[PATCH 0/3] drm/nouveau: stability fixes for NVAC (MCP79/MCP7A)

public inbox for drm-ai-reviews@public-inbox.freedesktop.org
 help / color / mirror / Atom feed

* [PATCH 0/3] drm/nouveau: stability fixes for NVAC (MCP79/MCP7A)
@ 2026-04-09 17:21 Marek Czernohous
  2026-04-09 17:21 ` [PATCH 1/3] drm/nouveau/pci: use nv46 MSI rearm for G94 (NVAC/MCP79) Marek Czernohous
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Marek Czernohous @ 2026-04-09 17:21 UTC (permalink / raw)
  To: nouveau; +Cc: dri-devel, Marek Czernohous

Hi,

this is my first patch series for the kernel. I'm running an Apple Mac
Mini with an NVAC (MCP79) chipset and DisplayPort output, and hit three
separate stability issues with the nouveau driver on kernel 6.18.

The patches are independent but together they bring the NVAC from
"needs NvMSI=0 workaround and can't run Wayland" to fully stable
operation with MSI enabled, working DPMS, and functional Wayland
(tested with Weston).

Patch 1: The MSI re-arm function for G94 uses memory-mapped register
access (nv40), which is unreliable on NVAC — the sister chipset NVAA
already has MSI disabled entirely for this reason. Switching to the
PCI config space method (nv46) fixes sporadic FIFO errors and hangs.

Patch 2: nv50_sor_atomic_disable() dereferences nv_encoder->crtc
unconditionally via container_of(). Under Wayland atomic modesetting,
the CRTC can become NULL between check and commit, causing a kernel
crash. Adding a NULL guard fixes Wayland session teardown.

Patch 3: Transient DisplayPort link glitches on NVAC cause unnecessary
disconnect/reconnect cycles. A single 100ms retry in the HPD IRQ
handler catches these without affecting real unplug detection.

All three patches were tested on the same machine (NVAC 0xac080b1,
DP output) over multiple days with DPMS cycling, VT switching, and
Wayland sessions.

Thanks,
Marek

Marek Czernohous (3):
  drm/nouveau/pci: use nv46 MSI rearm for G94 (NVAC/MCP79)
  drm/nouveau/kms: add NULL check for CRTC in nv50_sor_atomic_disable
  drm/nouveau/dp: retry link check once on HPD IRQ before disconnect

 drivers/gpu/drm/nouveau/dispnv50/disp.c       | 8 +++++++-
 drivers/gpu/drm/nouveau/nouveau_display.c     | 4 ++++
 drivers/gpu/drm/nouveau/nvkm/subdev/pci/g94.c | 2 +-
 3 files changed, 12 insertions(+), 2 deletions(-)

-- 
2.52.0

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/3] drm/nouveau/pci: use nv46 MSI rearm for G94 (NVAC/MCP79)
  2026-04-09 17:21 [PATCH 0/3] drm/nouveau: stability fixes for NVAC (MCP79/MCP7A) Marek Czernohous
@ 2026-04-09 17:21 ` Marek Czernohous
  2026-04-12  0:25   ` Claude review: " Claude Code Review Bot
  2026-04-09 17:21 ` [PATCH 2/3] drm/nouveau/kms: add NULL check for CRTC in nv50_sor_atomic_disable Marek Czernohous
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 8+ messages in thread
From: Marek Czernohous @ 2026-04-09 17:21 UTC (permalink / raw)
  To: nouveau; +Cc: dri-devel, Marek Czernohous

The G94 PCI function uses nv40_pci_msi_rearm(), which re-arms MSI
interrupts via memory-mapped register access. On NVAC (MCP79/MCP7A)
chipsets this method is unreliable, causing sporadic FIFO errors and
GPU hangs.

The closely related NVAA (MCP77/MCP73) chipset has MSI disabled
entirely in the driver (.msi_rearm is not set), with a comment marking
it as "reported broken". NVAC shares the same integrated GPU
architecture (both are NV50/Tesla family, memory-mapped through the
host bridge) but was not given the same treatment.

Switch to nv46_pci_msi_rearm(), which re-arms MSI via direct PCI
config space access (pci_write_config_byte at offset 0x68). This
method bypasses the memory-mapped register path that is problematic on
these integrated chipsets.

Tested on Apple Mac Mini (MCP79, NVAC 0xac080b1) with DisplayPort
output. System is stable with MSI enabled (no NvMSI=0 workaround
needed), zero FIFO errors observed over extended operation including
DPMS cycles.

Signed-off-by: Marek Czernohous <marek@czernohous.de>
---
 drivers/gpu/drm/nouveau/nvkm/subdev/pci/g94.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/pci/g94.c b/drivers/gpu/drm/nouveau/nvkm/subdev/pci/g94.c
index df745d069..9dc28a4bc 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/pci/g94.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/pci/g94.c
@@ -28,7 +28,7 @@ g94_pci_func = {
 	.cfg = { .addr = 0x088000, .size = 0x1000 },

 	.init = g84_pci_init,
-	.msi_rearm = nv40_pci_msi_rearm,
+	.msi_rearm = nv46_pci_msi_rearm,

 	.pcie.init = g84_pcie_init,
 	.pcie.set_link = g84_pcie_set_link,
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/3] drm/nouveau/kms: add NULL check for CRTC in nv50_sor_atomic_disable
  2026-04-09 17:21 [PATCH 0/3] drm/nouveau: stability fixes for NVAC (MCP79/MCP7A) Marek Czernohous
  2026-04-09 17:21 ` [PATCH 1/3] drm/nouveau/pci: use nv46 MSI rearm for G94 (NVAC/MCP79) Marek Czernohous
@ 2026-04-09 17:21 ` Marek Czernohous
  2026-04-12  0:25   ` Claude review: " Claude Code Review Bot
  2026-04-09 17:21 ` [PATCH 3/3] drm/nouveau/dp: retry link check once on HPD IRQ before disconnect Marek Czernohous
  2026-04-12  0:25 ` Claude review: drm/nouveau: stability fixes for NVAC (MCP79/MCP7A) Claude Code Review Bot
  3 siblings, 1 reply; 8+ messages in thread
From: Marek Czernohous @ 2026-04-09 17:21 UTC (permalink / raw)
  To: nouveau; +Cc: dri-devel, Marek Czernohous

nv50_sor_atomic_disable() calls nv50_head(nv_encoder->crtc) without
checking whether nv_encoder->crtc is NULL. Under Wayland compositors
that use atomic modesetting, a race condition can occur between
atomic_check and atomic_commit: the encoder's CRTC reference may
become NULL after validation but before the disable callback runs.

When this happens, nv50_head() receives a NULL drm_crtc pointer,
and container_of() produces a garbage pointer, leading to a kernel
crash (NULL pointer dereference or page fault).

Add an explicit NULL check for nv_encoder->crtc before dereferencing.
If the CRTC is already gone, release the output resource and return
early — there is nothing left to disable.

Note: checking nv50_head()'s return value would not work here because
container_of(NULL, ...) never returns NULL; the bogus pointer must be
prevented at the source.

Tested on NVAC (MCP79) with Weston --backend=drm. Without this patch,
switching VTs or closing a Wayland session triggers a kernel oops.
With this patch, Wayland session teardown and DPMS cycles are stable.

Signed-off-by: Marek Czernohous <marek@czernohous.de>
---
 drivers/gpu/drm/nouveau/dispnv50/disp.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/nouveau/dispnv50/disp.c b/drivers/gpu/drm/nouveau/dispnv50/disp.c
index 6c3a8712d..42080874a 100644
--- a/drivers/gpu/drm/nouveau/dispnv50/disp.c
+++ b/drivers/gpu/drm/nouveau/dispnv50/disp.c
@@ -1565,7 +1565,13 @@ static void
 nv50_sor_atomic_disable(struct drm_encoder *encoder, struct drm_atomic_state *state)
 {
 	struct nouveau_encoder *nv_encoder = nouveau_encoder(encoder);
-	struct nv50_head *head = nv50_head(nv_encoder->crtc);
+	struct nv50_head *head;
+
+	if (!nv_encoder->crtc) {
+		nvif_outp_release(&nv_encoder->outp);
+		return;
+	}
+	head = nv50_head(nv_encoder->crtc);
 #ifdef CONFIG_DRM_NOUVEAU_BACKLIGHT
 	struct nouveau_connector *nv_connector = nv50_outp_get_old_connector(state, nv_encoder);
 	struct nouveau_drm *drm = nouveau_drm(nv_encoder->base.base.dev);
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/3] drm/nouveau/dp: retry link check once on HPD IRQ before disconnect
  2026-04-09 17:21 [PATCH 0/3] drm/nouveau: stability fixes for NVAC (MCP79/MCP7A) Marek Czernohous
  2026-04-09 17:21 ` [PATCH 1/3] drm/nouveau/pci: use nv46 MSI rearm for G94 (NVAC/MCP79) Marek Czernohous
  2026-04-09 17:21 ` [PATCH 2/3] drm/nouveau/kms: add NULL check for CRTC in nv50_sor_atomic_disable Marek Czernohous
@ 2026-04-09 17:21 ` Marek Czernohous
  2026-04-12  0:25   ` Claude review: " Claude Code Review Bot
  2026-04-12  0:25 ` Claude review: drm/nouveau: stability fixes for NVAC (MCP79/MCP7A) Claude Code Review Bot
  3 siblings, 1 reply; 8+ messages in thread
From: Marek Czernohous @ 2026-04-09 17:21 UTC (permalink / raw)
  To: nouveau; +Cc: dri-devel, Marek Czernohous

On NV50/G94-family chipsets with DisplayPort, transient link glitches
can trigger an HPD IRQ where the first nouveau_dp_link_check() fails
momentarily. The driver then falls through to connector status
re-detection, treating it as a disconnect event. This causes a brief
display blackout followed by a re-plug, which is disruptive under
both X11 and Wayland.

Add a single retry with a 100ms delay before giving up on the link.
DisplayPort link training typically completes within a few
milliseconds; 100ms is generous enough to cover worst-case
re-negotiation on older hardware without adding perceptible latency
for genuine unplug events.

The retry is bounded (exactly one attempt) and only applies to the
IRQ path — plug/unplug events are not affected.

Tested on NVAC (MCP79) with DisplayPort output over extended DPMS
on/off cycles. Without this patch, occasional display flickers were
observed after DPMS resume. With this patch, all DPMS transitions are
clean.

Signed-off-by: Marek Czernohous <marek@czernohous.de>
---
 drivers/gpu/drm/nouveau/nouveau_display.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_display.c b/drivers/gpu/drm/nouveau/nouveau_display.c
index d71dcfc6e..b6e40794b 100644
--- a/drivers/gpu/drm/nouveau/nouveau_display.c
+++ b/drivers/gpu/drm/nouveau/nouveau_display.c
@@ -465,6 +465,10 @@ nouveau_display_hpd_work(struct work_struct *work)
 		if (bits & NVIF_CONN_EVENT_V0_IRQ) {
 			if (nouveau_dp_link_check(nv_connector))
 				continue;
+			/* Retry once after 100ms for transient DP glitches (NV50/G94) */
+			msleep(100);
+			if (nouveau_dp_link_check(nv_connector))
+				continue;
 		}

 		connector->status = drm_helper_probe_detect(connector, NULL, false);
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Claude review: drm/nouveau: stability fixes for NVAC (MCP79/MCP7A)
  2026-04-09 17:21 [PATCH 0/3] drm/nouveau: stability fixes for NVAC (MCP79/MCP7A) Marek Czernohous
                   ` (2 preceding siblings ...)
  2026-04-09 17:21 ` [PATCH 3/3] drm/nouveau/dp: retry link check once on HPD IRQ before disconnect Marek Czernohous
@ 2026-04-12  0:25 ` Claude Code Review Bot
  3 siblings, 0 replies; 8+ messages in thread
From: Claude Code Review Bot @ 2026-04-12  0:25 UTC (permalink / raw)
  To: dri-devel-reviews

Overall Series Review

Subject: drm/nouveau: stability fixes for NVAC (MCP79/MCP7A)
Author: Marek Czernohous <mczernohous@gmail.com>
Patches: 4
Reviewed: 2026-04-12T10:25:00.599272

---

This is a 3-patch series targeting stability on NVAC (MCP79) hardware, covering MSI re-arm, a NULL pointer guard in display teardown, and a DP link check retry. The author has clearly done real testing on hardware and the problems described are plausible. However, there are significant issues with accuracy in the commit messages and with the scope of the changes vs. what is described.

**Patch 1** is the strongest patch — it aligns g94 with its sibling chipsets (g84, g92) that already use the PCI config space method — but the commit message contains a factually incorrect claim about NVAA. **Patch 2** adds a defensive NULL check that prevents a crash, but the "race condition" explanation is questionable and the early-return path skips substantial cleanup. **Patch 3** is the most concerning — it adds a blanket 100ms sleep in a shared hot-plug work handler affecting all chipsets, not just the claimed NV50/G94.

The series would benefit from more accurate commit messages and, in at least one case, a narrower or better-justified fix.

---

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Claude review: drm/nouveau/pci: use nv46 MSI rearm for G94 (NVAC/MCP79)
  2026-04-09 17:21 ` [PATCH 1/3] drm/nouveau/pci: use nv46 MSI rearm for G94 (NVAC/MCP79) Marek Czernohous
@ 2026-04-12  0:25   ` Claude Code Review Bot
  0 siblings, 0 replies; 8+ messages in thread
From: Claude Code Review Bot @ 2026-04-12  0:25 UTC (permalink / raw)
  To: dri-devel-reviews

Patch Review

**Verdict: Likely correct fix, but commit message needs correction.**

The one-line change itself is sound:

```c
-.msi_rearm = nv40_pci_msi_rearm,
+.msi_rearm = nv46_pci_msi_rearm,
```

Looking at the two implementations:
- `nv40_pci_msi_rearm()` (nv40.c:29): `nvkm_pci_wr08(pci, 0x0068, 0xff)` — MMIO/PRI register write
- `nv46_pci_msi_rearm()` (nv46.c:36): `pci_write_config_byte(pdev, 0x68, 0xff)` — PCI config space write

The comment in nv46.c:28-30 explicitly states: *"MSI re-arm through the PRI appears to be broken on NV46/NV50/G84/G86/G92, so we access it via alternate PCI config space mechanisms."* Both g84 and g92 already use `nv46_pci_msi_rearm`, so g94 using `nv40_pci_msi_rearm` was arguably the odd one out in its family.

**Issue 1: Factually incorrect claim about NVAA.** The commit message states: *"The closely related NVAA (MCP77/MCP73) chipset has MSI disabled entirely in the driver (.msi_rearm is not set)."* This is **wrong**. Looking at `base.c:1209`, NVAA uses `.pci = { 0x00000001, g94_pci_new }` — the exact same PCI function as NVAC. NVAA has the same `msi_rearm` setting as every other g94-family chip. This claim should be removed or corrected.

**Issue 2: Scope wider than described.** The `g94_pci_func` is shared by approximately 10 chipsets (G94, G96, G98, NVAA, NVAC, NVAF/MCP89, and several GT2xx variants). The commit message frames this as an NVAC-specific fix, but it changes behavior for all of them. The change is likely safe (and arguably correct for the whole family given the nv46.c comment and g84/g92 precedent), but the commit message should acknowledge this broader scope.

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Claude review: drm/nouveau/kms: add NULL check for CRTC in nv50_sor_atomic_disable
  2026-04-09 17:21 ` [PATCH 2/3] drm/nouveau/kms: add NULL check for CRTC in nv50_sor_atomic_disable Marek Czernohous
@ 2026-04-12  0:25   ` Claude Code Review Bot
  0 siblings, 0 replies; 8+ messages in thread
From: Claude Code Review Bot @ 2026-04-12  0:25 UTC (permalink / raw)
  To: dri-devel-reviews

Patch Review

**Verdict: Defensive fix that prevents a crash, but the explanation is suspect and the early-return may skip necessary cleanup.**

The change:
```c
+	if (!nv_encoder->crtc) {
+		nvif_outp_release(&nv_encoder->outp);
+		return;
+	}
+	head = nv50_head(nv_encoder->crtc);
```

**Issue 1: The race condition explanation is dubious.** The commit says: *"a race condition can occur between atomic_check and atomic_commit: the encoder's CRTC reference may become NULL after validation but before the disable callback runs."* In the DRM atomic framework, `atomic_commit` holds all the necessary modeset locks. The `nv_encoder->crtc` field is nouveau's own internal tracking, set at the end of `nv50_sor_atomic_disable` itself (line 1604: `nv_encoder->crtc = NULL`). If `crtc` is NULL entering this function, it is more likely that the encoder was never attached, or that there's a double-disable scenario — not an atomic check/commit race. The commit message should explain the actual root cause or at least describe it as "crtc may be NULL in certain teardown paths" without speculating about a race that the locking should prevent.

**Issue 2: The early return skips significant cleanup.** When `nv_encoder->crtc` is NULL, the function returns after only calling `nvif_outp_release()`. But the full path also does:
- Backlight disable via DPCD (`drm_edp_backlight_disable`)
- HDMI state cleanup (`nvif_outp_hdmi(..., false, ...)`)
- DP power-down (`nouveau_dp_power_down`)
- Display ID clearing (`head->func->display_id(head, 0)`)
- Encoder update and audio disable

Some of these (like DP power-down, backlight disable) operate on the encoder/connector, not the CRTC, and might still be needed even when the CRTC is gone. If the CRTC being NULL is truly expected, the commit should justify why none of that other cleanup matters.

**Issue 3: The note about container_of is correct** — `container_of(NULL, ...)` produces a bogus non-NULL pointer, so the check must be done before calling `nv50_head()`. This is well-explained.

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Claude review: drm/nouveau/dp: retry link check once on HPD IRQ before disconnect
  2026-04-09 17:21 ` [PATCH 3/3] drm/nouveau/dp: retry link check once on HPD IRQ before disconnect Marek Czernohous
@ 2026-04-12  0:25   ` Claude Code Review Bot
  0 siblings, 0 replies; 8+ messages in thread
From: Claude Code Review Bot @ 2026-04-12  0:25 UTC (permalink / raw)
  To: dri-devel-reviews

Patch Review

**Verdict: Questionable approach — adds latency to a common hot-plug path for all chipsets.**

The change:
```c
 		if (bits & NVIF_CONN_EVENT_V0_IRQ) {
 			if (nouveau_dp_link_check(nv_connector))
 				continue;
+			/* Retry once after 100ms for transient DP glitches (NV50/G94) */
+			msleep(100);
+			if (nouveau_dp_link_check(nv_connector))
+				continue;
 		}
```

**Issue 1: Global impact despite chipset-specific comment.** The comment says "NV50/G94" but `nouveau_display_hpd_work()` is the HPD handler for **all** nouveau-supported hardware. Every DP connector on every supported GPU will now incur a 100ms sleep on any HPD IRQ where the first link check fails. This should either be gated on the chipset or the comment should acknowledge the broader scope.

**Issue 2: msleep(100) in a shared workqueue.** While sleeping in a workqueue is technically valid, this is `nouveau_display_hpd_work` which processes HPD events for potentially multiple connectors in a loop (the `drm_connector_list_iter` at the top). A 100ms stall blocks processing of other connectors' HPD events in the same work item. If multiple connectors fire IRQs simultaneously and all fail the first link check, the cumulative delay compounds.

**Issue 3: Treating the symptom, not the cause.** If the NVAC DP link is transiently failing after DPMS resume, the root cause may be that link training isn't waiting long enough or that the sink isn't ready. A retry loop at the HPD IRQ level is a blunt instrument — the DP link check function itself or the link training code would be a more targeted place to add resilience.

**Issue 4: No Fixes: tag.** If this fixes a user-visible regression or bug, it should include a `Fixes:` tag identifying when the behavior was introduced. Same applies to patches 1 and 2. For stable kernel backporting, `Cc: stable@vger.kernel.org` should be considered.

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-04-12  0:25 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-09 17:21 [PATCH 0/3] drm/nouveau: stability fixes for NVAC (MCP79/MCP7A) Marek Czernohous
2026-04-09 17:21 ` [PATCH 1/3] drm/nouveau/pci: use nv46 MSI rearm for G94 (NVAC/MCP79) Marek Czernohous
2026-04-12  0:25   ` Claude review: " Claude Code Review Bot
2026-04-09 17:21 ` [PATCH 2/3] drm/nouveau/kms: add NULL check for CRTC in nv50_sor_atomic_disable Marek Czernohous
2026-04-12  0:25   ` Claude review: " Claude Code Review Bot
2026-04-09 17:21 ` [PATCH 3/3] drm/nouveau/dp: retry link check once on HPD IRQ before disconnect Marek Czernohous
2026-04-12  0:25   ` Claude review: " Claude Code Review Bot
2026-04-12  0:25 ` Claude review: drm/nouveau: stability fixes for NVAC (MCP79/MCP7A) Claude Code Review Bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox