[PATCH 0/2] drm/nouveau: nv04 FIFO cleanup + recovery for Tesla

public inbox for drm-ai-reviews@public-inbox.freedesktop.org
 help / color / mirror / Atom feed

* [PATCH 0/2] drm/nouveau: nv04 FIFO cleanup + recovery for Tesla
@ 2026-05-13 17:50 Marek Czernohous
  2026-05-13 17:50 ` [PATCH 1/2] drm/nouveau/fifo/nv04: filter benign CACHE_ERROR from Mesa NV50 bind probe Marek Czernohous
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Marek Czernohous @ 2026-05-13 17:50 UTC (permalink / raw)
  To: nouveau
  Cc: Lyude Paul, Danilo Krummrich, dri-devel, linux-kernel,
	Marek Czernohous

Hi all,

Two-patch series for the legacy nv04_fifo path covering Tesla
(MCP77/MCP79 and G80-GT218). Daily-driven on the reference NVAC
hardware (Apple Mac mini Late 2009, GeForce 9400M) since 2026-05-05.

Patch 1 demotes a benign CACHE_ERROR that fires once per Mesa session
start on Tesla GPUs. The Mesa NV50 userspace driver issues a method-
0x0060 / data-0xbeef02xx binding probe that recovers cleanly via
nv04_fifo_swmthd(), but currently logs at error level on every X or
Wayland session, dominating dmesg noise on this hardware class. This
clears the channel for patch 2 to identify real faults from noise.

Patch 2 adds a two-tier fault-recovery path for Tesla FIFO faults:

  Tier 1 (per fault). Look up the channel via nvkm_chan_get_chid,
  call nvkm_chan_error(chan, true), fire tracepoint
  nouveau:fifo_chan_killed. Idempotent through the existing
  chan->errored short-circuit.

  Tier 2 (sliding window). When the per-fifo fault count in a
  configurable window reaches the threshold, schedule a worker that
  calls drm_dev_wedged_event(drm, DRM_WEDGE_RECOVERY_REBIND, NULL)
  and fires tracepoint nouveau:fifo_dev_wedged. Worker context is
  needed because kobject_uevent_env may sleep.

Motivation: Fermi+ gets channel-kill and device-wedge automatically
through nvkm_runl_rc; Tesla was feature-frozen before the DRM wedge
uAPI existed. Three observable consequences on the reference
hardware:

  1. Silent state corruption (channel produces wrong output after a
     fault, no notice to userspace).
  2. Observability gap (no counters, tracepoints, or wedge event,
     only dmesg).
  3. Repeated-fault loop (the log-and-reset cycle repeats forever on
     a persistently faulting channel instead of killing it).

Validation. A debugfs fault-injector (kept on a separate
DO-NOT-MERGE branch, not part of this submission) was used to drive
both Tier-1 and Tier-2 paths through their full state space. Phases
1-5 of the test plan were exercised that way. Phase 6 (no manual
injection, real workload soak) ran 2026-05-05 through 2026-05-13:
one organic DRM_WEDGE_RECOVERY_REBIND event was captured on
2026-05-05 09:08; the rest of the soak was fault-free.

Companion userland tool nouveau-pstate-daemon v0.2.0 [1] subscribes
to the WEDGED=rebind uevent in log-only mode and was used to confirm
end-to-end propagation through udev.

Module parameters:

  nouveau.fifo_wedge_count     (uint, 0..32, default 10)
  nouveau.fifo_wedge_window_ms (uint, 100..600000, default 60000)

Setting fifo_wedge_count=0 disables Tier-2 entirely while keeping
Tier-1 channel-kill active.

A note on MAINTAINERS. The series adds a new file
drivers/gpu/drm/nouveau/nvkm/engine/fifo/recover.c. The change is
covered by the existing nouveau MAINTAINERS section
(drivers/gpu/drm/nouveau/), so no MAINTAINERS update is included.
checkpatch.pl flags this as a hint; it is not load-bearing.

This is a follow-up to the April 9 NVAC stability series [2], which
is still awaiting review. The two patches here are independent of
that series and apply against current Linus master.

[1] https://github.com/hibbes/nouveau-pstate-daemon (v0.2.0)
[2] https://lore.kernel.org/dri-devel/20260409-nouveau-nvac-stability-series

Marek Czernohous (2):
  drm/nouveau/fifo/nv04: filter benign CACHE_ERROR from Mesa NV50 bind
    probe
  drm/nouveau/fifo: add recovery path for Tesla cache_error/dma_pusher

 .../drm/nouveau/include/nvkm/engine/fifo.h    |  12 ++
 .../include/trace/events/nouveau_fifo.h       |  58 +++++++++
 drivers/gpu/drm/nouveau/nouveau_drm.c         |  29 +++++
 .../gpu/drm/nouveau/nvkm/engine/fifo/Kbuild   |   1 +
 .../gpu/drm/nouveau/nvkm/engine/fifo/base.c   |   3 +
 .../gpu/drm/nouveau/nvkm/engine/fifo/nv04.c   |  29 ++++-
 .../gpu/drm/nouveau/nvkm/engine/fifo/priv.h   |  10 ++
 .../drm/nouveau/nvkm/engine/fifo/recover.c    | 121 ++++++++++++++++++
 8 files changed, 257 insertions(+), 6 deletions(-)
 create mode 100644 drivers/gpu/drm/nouveau/include/trace/events/nouveau_fifo.h
 create mode 100644 drivers/gpu/drm/nouveau/nvkm/engine/fifo/recover.c

base-commit: 1f63dd8ca0dc05a8272bb8155f643c691d29bb11
-- 
2.53.0

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/2] drm/nouveau/fifo/nv04: filter benign CACHE_ERROR from Mesa NV50 bind probe
  2026-05-13 17:50 [PATCH 0/2] drm/nouveau: nv04 FIFO cleanup + recovery for Tesla Marek Czernohous
@ 2026-05-13 17:50 ` Marek Czernohous
  2026-05-16  1:34   ` Claude review: " Claude Code Review Bot
  2026-05-13 17:50 ` [PATCH 2/2] drm/nouveau/fifo: add recovery path for Tesla cache_error/dma_pusher Marek Czernohous
  2026-05-16  1:34 ` Claude review: drm/nouveau: nv04 FIFO cleanup + recovery for Tesla Claude Code Review Bot
  2 siblings, 1 reply; 6+ messages in thread
From: Marek Czernohous @ 2026-05-13 17:50 UTC (permalink / raw)
  To: nouveau
  Cc: Lyude Paul, Danilo Krummrich, dri-devel, linux-kernel,
	Marek Czernohous

The Mesa userspace driver issues a method-0x0060 / data-0xbeef02xx
binding probe on Tesla GPUs that ends up triggering CACHE_ERROR in the
PFIFO interrupt handler. The probe is harmless and recovers cleanly,
but it floods dmesg at error level on every X/Wayland session start.

Filter that specific pattern down to debug level so dmesg stays clean
while real CACHE_ERROR conditions are still logged at error level.

Tested on Apple Mac Mini (MCP79, NVAC 0xac080b1) and a G94: dmesg has
no CACHE_ERROR spam during normal operation, the previously visible
beef02xx pattern now only appears at debug level.

Signed-off-by: Marek Czernohous <marek@czernohous.de>
---
 .../gpu/drm/nouveau/nvkm/engine/fifo/nv04.c   | 25 ++++++++++++++-----
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/nv04.c b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/nv04.c
index c4b8e567d86f..fa13cd55b593 100644
--- a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/nv04.c
+++ b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/nv04.c
@@ -327,12 +327,25 @@ nv04_fifo_intr_cache_error(struct nvkm_fifo *fifo, u32 chid, u32 get)
 
 	if (!(pull0 & 0x00000100) ||
 	    !nv04_fifo_swmthd(device, chid, mthd, data)) {
-		chan = nvkm_chan_get_chid(&fifo->engine, chid, &flags);
-		nvkm_error(subdev, "CACHE_ERROR - "
-			   "ch %d [%s] subc %d mthd %04x data %08x\n",
-			   chid, chan ? chan->name : "unknown",
-			   (mthd >> 13) & 7, mthd & 0x1ffc, data);
-		nvkm_chan_put(&chan, flags);
+		/*
+		 * Filter benign Mesa NV50 bind probe: mthd 0x0060 with
+		 * data 0xbeef02xx is a harmless userspace probe on Tesla
+		 * GPUs and does not indicate an actual error condition.
+		 * Demote to debug to keep dmesg clean while still catching
+		 * real CACHE_ERROR events.
+		 */
+		if ((mthd & 0x1ffc) == 0x0060 &&
+		    (data & 0xffffff00) == 0xbeef0200) {
+			nvkm_debug(subdev, "CACHE_ERROR - ch %d subc %d mthd %04x data %08x (benign, skipped)\n",
+				   chid, (mthd >> 13) & 7, mthd & 0x1ffc, data);
+		} else {
+			chan = nvkm_chan_get_chid(&fifo->engine, chid, &flags);
+			nvkm_error(subdev, "CACHE_ERROR - "
+				   "ch %d [%s] subc %d mthd %04x data %08x\n",
+				   chid, chan ? chan->name : "unknown",
+				   (mthd >> 13) & 7, mthd & 0x1ffc, data);
+			nvkm_chan_put(&chan, flags);
+		}
 	}
 
 	nvkm_wr32(device, NV04_PFIFO_CACHE1_DMA_PUSH, 0);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/2] drm/nouveau/fifo: add recovery path for Tesla cache_error/dma_pusher
  2026-05-13 17:50 [PATCH 0/2] drm/nouveau: nv04 FIFO cleanup + recovery for Tesla Marek Czernohous
  2026-05-13 17:50 ` [PATCH 1/2] drm/nouveau/fifo/nv04: filter benign CACHE_ERROR from Mesa NV50 bind probe Marek Czernohous
@ 2026-05-13 17:50 ` Marek Czernohous
  2026-05-16  1:34   ` Claude review: " Claude Code Review Bot
  2026-05-16  1:34 ` Claude review: drm/nouveau: nv04 FIFO cleanup + recovery for Tesla Claude Code Review Bot
  2 siblings, 1 reply; 6+ messages in thread
From: Marek Czernohous @ 2026-05-13 17:50 UTC (permalink / raw)
  To: nouveau
  Cc: Lyude Paul, Danilo Krummrich, dri-devel, linux-kernel,
	Marek Czernohous

On Tesla / NV50 family chipsets (nv50, g84, g94, g98, mcp77, mcp79),
FIFO fault handling in nv04_fifo_intr_cache_error and
nv04_fifo_intr_dma_pusher logs the fault and resets hardware registers
but leaves the offending channel running. Compared to Fermi+ which
calls nvkm_chan_error from nvkm_runl_rc, Tesla has no escalation:
silent state corruption is possible, no telemetry beyond dmesg, and
repeated faults on the same channel keep firing forever.

Add a shared recovery helper nv04_fifo_recover() that both intr
handlers call after the existing logging+reset sequence. It implements
two tiers:

  Tier-1 (per-fault): nvkm_chan_get_chid + nvkm_chan_error(chan, true).
  The atomic chan->errored short-circuit means re-faults on the same
  channel are no-op; other channels are unaffected.

  Tier-2 (sliding-window): per-fifo lock-protected ring of fault
  timestamps. When the count within fifo_wedge_window_ms reaches
  fifo_wedge_count, schedule a workqueue job that emits a
  drm_dev_wedged_event with DRM_WEDGE_RECOVERY_REBIND. The
  drm_dev_wedged_event call cannot run from IRQ context because
  kobject_uevent_env may sleep; the workqueue indirection handles this.

Tracepoints (nouveau:fifo_chan_killed, nouveau:fifo_dev_wedged) provide
zero-overhead telemetry consumable via perf or bpftrace.

Module parameters fifo_wedge_count (default 10, range 0..32, 0=Tier-2
disabled) and fifo_wedge_window_ms (default 60000, range 100..600000)
allow tuning without rebuild.

Validated on Apple Mac mini Late 2009 (NVAC, MCP79).

Signed-off-by: Marek Czernohous <marek@czernohous.de>
---
 .../drm/nouveau/include/nvkm/engine/fifo.h    |  12 ++
 .../include/trace/events/nouveau_fifo.h       |  58 +++++++++
 drivers/gpu/drm/nouveau/nouveau_drm.c         |  29 +++++
 .../gpu/drm/nouveau/nvkm/engine/fifo/Kbuild   |   1 +
 .../gpu/drm/nouveau/nvkm/engine/fifo/base.c   |   3 +
 .../gpu/drm/nouveau/nvkm/engine/fifo/nv04.c   |   4 +
 .../gpu/drm/nouveau/nvkm/engine/fifo/priv.h   |  10 ++
 .../drm/nouveau/nvkm/engine/fifo/recover.c    | 121 ++++++++++++++++++
 8 files changed, 238 insertions(+)
 create mode 100644 drivers/gpu/drm/nouveau/include/trace/events/nouveau_fifo.h
 create mode 100644 drivers/gpu/drm/nouveau/nvkm/engine/fifo/recover.c

diff --git a/drivers/gpu/drm/nouveau/include/nvkm/engine/fifo.h b/drivers/gpu/drm/nouveau/include/nvkm/engine/fifo.h
index 96c16cfccf16..7c27b4c8a212 100644
--- a/drivers/gpu/drm/nouveau/include/nvkm/engine/fifo.h
+++ b/drivers/gpu/drm/nouveau/include/nvkm/engine/fifo.h
@@ -55,6 +55,17 @@ void nvkm_chan_put(struct nvkm_chan **, unsigned long irqflags);
 
 struct nvkm_chan *nvkm_uchan_chan(struct nvkm_object *);
 
+#define NVKM_FIFO_WEDGE_RING_MAX 32
+
+struct nvkm_fifo_wedge {
+	spinlock_t       lock;
+	u32              count;                                  /* aktuelle Fenster-Tiefe */
+	ktime_t          ts[NVKM_FIFO_WEDGE_RING_MAX];          /* Ring von Timestamps */
+	u32              head;                                   /* Ring-Head */
+	struct work_struct work;                                /* schedules drm_dev_wedged_event */
+	atomic_t         wedged;                                /* Tier-2 already fired? */
+};
+
 struct nvkm_fifo {
 	const struct nvkm_fifo_func *func;
 	struct nvkm_engine engine;
@@ -86,6 +97,7 @@ struct nvkm_fifo {
 
 	spinlock_t lock;
 	struct mutex mutex;
+	struct nvkm_fifo_wedge wedge;
 };
 
 void nvkm_fifo_fault(struct nvkm_fifo *, struct nvkm_fault_data *);
diff --git a/drivers/gpu/drm/nouveau/include/trace/events/nouveau_fifo.h b/drivers/gpu/drm/nouveau/include/trace/events/nouveau_fifo.h
new file mode 100644
index 000000000000..46d043a82850
--- /dev/null
+++ b/drivers/gpu/drm/nouveau/include/trace/events/nouveau_fifo.h
@@ -0,0 +1,58 @@
+/* SPDX-License-Identifier: MIT */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM nouveau
+
+#if !defined(_TRACE_NOUVEAU_FIFO_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_NOUVEAU_FIFO_H
+
+#include <linux/tracepoint.h>
+#include <drm/drm_device.h>
+
+TRACE_EVENT(nouveau_fifo_chan_killed,
+	TP_PROTO(struct drm_device *dev, u32 chid, u32 fault_type, u64 info),
+	TP_ARGS(dev, chid, fault_type, info),
+	TP_STRUCT__entry(
+		__string(devname, dev_name(dev->dev))
+		__field(u32, chid)
+		__field(u32, fault_type)
+		__field(u64, info)
+	),
+	TP_fast_assign(
+		__assign_str(devname);
+		__entry->chid = chid;
+		__entry->fault_type = fault_type;
+		__entry->info = info;
+	),
+	TP_printk("dev=%s chid=%u fault=%s info=0x%llx",
+		__get_str(devname),
+		__entry->chid,
+		__entry->fault_type == 0 ? "CACHE_ERROR" : "DMA_PUSHER",
+		__entry->info)
+);
+
+TRACE_EVENT(nouveau_fifo_dev_wedged,
+	TP_PROTO(struct drm_device *dev, u32 fault_count, u32 window_ms),
+	TP_ARGS(dev, fault_count, window_ms),
+	TP_STRUCT__entry(
+		__string(devname, dev_name(dev->dev))
+		__field(u32, fault_count)
+		__field(u32, window_ms)
+	),
+	TP_fast_assign(
+		__assign_str(devname);
+		__entry->fault_count = fault_count;
+		__entry->window_ms = window_ms;
+	),
+	TP_printk("dev=%s wedged after %u faults in %u ms",
+		__get_str(devname),
+		__entry->fault_count,
+		__entry->window_ms)
+);
+
+#endif /* _TRACE_NOUVEAU_FIFO_H */
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH ../../drivers/gpu/drm/nouveau/include/trace/events
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_FILE nouveau_fifo
+#include <trace/define_trace.h>
diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c
index 517ff2c31dce..c62b7fc3a1d3 100644
--- a/drivers/gpu/drm/nouveau/nouveau_drm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
@@ -22,6 +22,8 @@
  * Authors: Ben Skeggs
  */
 
+#define CREATE_TRACE_POINTS
+
 #include <linux/aperture.h>
 #include <linux/delay.h>
 #include <linux/module.h>
@@ -74,6 +76,9 @@
 #include "nouveau_uvmm.h"
 #include "nouveau_sched.h"
 
+#include <engine/fifo.h>
+#include <trace/events/nouveau_fifo.h>
+
 DECLARE_DYNDBG_CLASSMAP(drm_debug_classes, DD_CLASS_TYPE_DISJOINT_BITS, 0,
 			"DRM_UT_CORE",
 			"DRM_UT_DRIVER",
@@ -111,6 +116,18 @@ MODULE_PARM_DESC(runpm, "disable (0), force enable (1), optimus only default (-1
 static int nouveau_runtime_pm = -1;
 module_param_named(runpm, nouveau_runtime_pm, int, 0400);
 
+MODULE_PARM_DESC(fifo_wedge_count,
+	"FIFO faults within window before drm_dev_wedged_event "
+	"(0=disable Tier-2, max 32, default 10)");
+unsigned int nouveau_fifo_wedge_count = 10;
+module_param_named(fifo_wedge_count, nouveau_fifo_wedge_count, uint, 0400);
+
+MODULE_PARM_DESC(fifo_wedge_window_ms,
+	"Sliding-window width in milliseconds for fifo_wedge_count "
+	"(default 60000)");
+unsigned int nouveau_fifo_wedge_window_ms = 60000;
+module_param_named(fifo_wedge_window_ms, nouveau_fifo_wedge_window_ms, uint, 0400);
+
 static struct drm_driver driver_stub;
 static struct drm_driver driver_pci;
 static struct drm_driver driver_platform;
@@ -1495,6 +1512,18 @@ nouveau_drm_init(void)
 	if (!nouveau_modeset)
 		return 0;
 
+	if (nouveau_fifo_wedge_count > NVKM_FIFO_WEDGE_RING_MAX) {
+		pr_warn("nouveau: fifo_wedge_count=%u exceeds max %u; clamping\n",
+			nouveau_fifo_wedge_count, NVKM_FIFO_WEDGE_RING_MAX);
+		nouveau_fifo_wedge_count = NVKM_FIFO_WEDGE_RING_MAX;
+	}
+	if (nouveau_fifo_wedge_window_ms < 100 ||
+	    nouveau_fifo_wedge_window_ms > 600000) {
+		pr_warn("nouveau: fifo_wedge_window_ms=%u out of range; resetting to 60000\n",
+			nouveau_fifo_wedge_window_ms);
+		nouveau_fifo_wedge_window_ms = 60000;
+	}
+
 	nouveau_module_debugfs_init();
 
 #ifdef CONFIG_NOUVEAU_PLATFORM_DRIVER
diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/Kbuild b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/Kbuild
index 376e9c3bcb1a..1ff29753731d 100644
--- a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/Kbuild
+++ b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/Kbuild
@@ -5,6 +5,7 @@ nvkm-y += nvkm/engine/fifo/chan.o
 nvkm-y += nvkm/engine/fifo/chid.o
 nvkm-y += nvkm/engine/fifo/runl.o
 nvkm-y += nvkm/engine/fifo/runq.o
+nvkm-y += nvkm/engine/fifo/recover.o
 
 nvkm-y += nvkm/engine/fifo/nv04.o
 nvkm-y += nvkm/engine/fifo/nv10.o
diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/base.c b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/base.c
index 9dd924694306..a61183fa38af 100644
--- a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/base.c
+++ b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/base.c
@@ -337,6 +337,8 @@ nvkm_fifo_dtor(struct nvkm_engine *engine)
 	struct nvkm_runl *runl, *runt;
 	struct nvkm_runq *runq, *rtmp;
 
+	nv04_fifo_wedge_fini(fifo);
+
 	if (fifo->userd.bar1)
 		nvkm_vmm_put(nvkm_bar_bar1_vmm(engine->subdev.device), &fifo->userd.bar1);
 	nvkm_memory_unref(&fifo->userd.mem);
@@ -390,6 +392,7 @@ nvkm_fifo_new_(const struct nvkm_fifo_func *func, struct nvkm_device *device,
 	fifo->timeout.chan_msec = 10000;
 	spin_lock_init(&fifo->lock);
 	mutex_init(&fifo->mutex);
+	nv04_fifo_wedge_init(fifo);
 
 	return nvkm_engine_ctor(&nvkm_fifo, device, type, inst, true, &fifo->engine);
 }
diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/nv04.c b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/nv04.c
index fa13cd55b593..cb81941ecccd 100644
--- a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/nv04.c
+++ b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/nv04.c
@@ -345,6 +345,8 @@ nv04_fifo_intr_cache_error(struct nvkm_fifo *fifo, u32 chid, u32 get)
 				   chid, chan ? chan->name : "unknown",
 				   (mthd >> 13) & 7, mthd & 0x1ffc, data);
 			nvkm_chan_put(&chan, flags);
+			nv04_fifo_recover(fifo, chid, NV04_FAULT_CACHE_ERROR,
+					  ((u64)mthd << 32) | data);
 		}
 	}
 
@@ -410,6 +412,8 @@ nv04_fifo_intr_dma_pusher(struct nvkm_fifo *fifo, u32 chid)
 	}
 	nvkm_chan_put(&chan, flags);
 
+	nv04_fifo_recover(fifo, chid, NV04_FAULT_DMA_PUSHER, state);
+
 	nvkm_wr32(device, 0x003228, 0x00000000);
 	nvkm_wr32(device, 0x003220, 0x00000001);
 	nvkm_wr32(device, 0x002100, NV_PFIFO_INTR_DMA_PUSHER);
diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/priv.h b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/priv.h
index fff1428ef267..bf551906dcd4 100644
--- a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/priv.h
+++ b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/priv.h
@@ -83,6 +83,16 @@ void nv04_chan_start(struct nvkm_chan *);
 void nv04_chan_stop(struct nvkm_chan *);
 void nv04_eobj_ramht_del(struct nvkm_chan *, int);
 
+/* Recovery helper for Tesla cache_error/dma_pusher (recover.c). */
+#define NV04_FAULT_CACHE_ERROR	0
+#define NV04_FAULT_DMA_PUSHER	1
+
+void nv04_fifo_recover(struct nvkm_fifo *fifo, u32 chid, u32 fault_type, u64 info);
+void nv04_fifo_wedge_init(struct nvkm_fifo *fifo);
+void nv04_fifo_wedge_fini(struct nvkm_fifo *fifo);
+extern unsigned int nouveau_fifo_wedge_count;
+extern unsigned int nouveau_fifo_wedge_window_ms;
+
 int nv10_fifo_chid_nr(struct nvkm_fifo *);
 
 int nv50_fifo_chid_nr(struct nvkm_fifo *);
diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/recover.c b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/recover.c
new file mode 100644
index 000000000000..14c0eebf7040
--- /dev/null
+++ b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/recover.c
@@ -0,0 +1,121 @@
+// SPDX-License-Identifier: MIT
+/*
+ * nv04_fifo_recover - shared recovery helper for Tesla cache_error and
+ * dma_pusher fault paths.
+ *
+ * Tier-1: kill the offending channel via nvkm_chan_error.
+ * Tier-2: after a configurable burst of faults within a sliding time
+ *         window, request a device-wide drm_dev_wedged_event so userspace
+ *         can rebind the driver.
+ */
+
+#include "priv.h"
+#include "chan.h"
+
+#include <core/device.h>
+#include <subdev/timer.h>
+
+#include <linux/workqueue.h>
+#include <linux/jiffies.h>
+#include <linux/ktime.h>
+#include <drm/drm_drv.h>
+#include <drm/drm_device.h>
+
+#include "nouveau_drv.h"
+#include <trace/events/nouveau_fifo.h>
+
+static struct drm_device *
+nv04_fifo_drm_device(struct nvkm_fifo *fifo)
+{
+	struct nvkm_device *device = fifo->engine.subdev.device;
+	struct nouveau_drm *drm = dev_get_drvdata(device->dev);
+
+	return (drm && drm->dev) ? drm->dev : NULL;
+}
+
+void
+nv04_fifo_recover(struct nvkm_fifo *fifo, u32 chid, u32 fault_type, u64 info)
+{
+	struct drm_device *drm_dev = nv04_fifo_drm_device(fifo);
+	struct nvkm_chan *chan;
+	unsigned long flags;
+	ktime_t now, cutoff;
+	u32 i, count;
+
+	if (drm_dev)
+		trace_nouveau_fifo_chan_killed(drm_dev, chid, fault_type, info);
+
+	chan = nvkm_chan_get_chid(&fifo->engine, chid, &flags);
+	if (chan) {
+		nvkm_chan_error(chan, true);
+		nvkm_chan_put(&chan, flags);
+	}
+
+	if (nouveau_fifo_wedge_count == 0)
+		return;
+
+	now = ktime_get();
+	cutoff = ktime_sub_ms(now, nouveau_fifo_wedge_window_ms);
+
+	spin_lock_irqsave(&fifo->wedge.lock, flags);
+
+	/* Insert current first, then purge expired and count survivors. */
+	fifo->wedge.ts[fifo->wedge.head] = now;
+	fifo->wedge.head = (fifo->wedge.head + 1) % NVKM_FIFO_WEDGE_RING_MAX;
+
+	count = 0;
+	for (i = 0; i < NVKM_FIFO_WEDGE_RING_MAX; i++) {
+		if (!ktime_to_ns(fifo->wedge.ts[i]))
+			continue;
+		if (ktime_before(fifo->wedge.ts[i], cutoff))
+			fifo->wedge.ts[i] = 0;
+		else
+			count++;
+	}
+	fifo->wedge.count = count;
+
+	if (count >= nouveau_fifo_wedge_count)
+		schedule_work(&fifo->wedge.work);
+
+	spin_unlock_irqrestore(&fifo->wedge.lock, flags);
+}
+
+static void
+nv04_fifo_wedge_work(struct work_struct *work)
+{
+	struct nvkm_fifo_wedge *w = container_of(work, struct nvkm_fifo_wedge, work);
+	struct nvkm_fifo *fifo = container_of(w, struct nvkm_fifo, wedge);
+	struct drm_device *drm_dev = nv04_fifo_drm_device(fifo);
+	u32 fault_count;
+
+	if (atomic_xchg(&w->wedged, 1) != 0)
+		return; /* already wedged this cycle */
+
+	if (!drm_dev)
+		return;
+
+	fault_count = w->count;
+
+	dev_info(drm_dev->dev,
+		 "nouveau: fifo wedged after %u faults in %u ms\n",
+		 fault_count, nouveau_fifo_wedge_window_ms);
+
+	trace_nouveau_fifo_dev_wedged(drm_dev, fault_count,
+				      nouveau_fifo_wedge_window_ms);
+
+	drm_dev_wedged_event(drm_dev, DRM_WEDGE_RECOVERY_REBIND, NULL);
+}
+
+void
+nv04_fifo_wedge_init(struct nvkm_fifo *fifo)
+{
+	spin_lock_init(&fifo->wedge.lock);
+	INIT_WORK(&fifo->wedge.work, nv04_fifo_wedge_work);
+	atomic_set(&fifo->wedge.wedged, 0);
+}
+
+void
+nv04_fifo_wedge_fini(struct nvkm_fifo *fifo)
+{
+	cancel_work_sync(&fifo->wedge.work);
+}
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Claude review: drm/nouveau: nv04 FIFO cleanup + recovery for Tesla
  2026-05-13 17:50 [PATCH 0/2] drm/nouveau: nv04 FIFO cleanup + recovery for Tesla Marek Czernohous
  2026-05-13 17:50 ` [PATCH 1/2] drm/nouveau/fifo/nv04: filter benign CACHE_ERROR from Mesa NV50 bind probe Marek Czernohous
  2026-05-13 17:50 ` [PATCH 2/2] drm/nouveau/fifo: add recovery path for Tesla cache_error/dma_pusher Marek Czernohous
@ 2026-05-16  1:34 ` Claude Code Review Bot
  2 siblings, 0 replies; 6+ messages in thread
From: Claude Code Review Bot @ 2026-05-16  1:34 UTC (permalink / raw)
  To: dri-devel-reviews

Overall Series Review

Subject: drm/nouveau: nv04 FIFO cleanup + recovery for Tesla
Author: Marek Czernohous <mczernohous@gmail.com>
Patches: 3
Reviewed: 2026-05-16T11:34:49.332858

---

This is a well-structured two-patch series adding observability and recovery to the legacy nv04 FIFO path on Tesla GPUs. The cover letter is thorough, the design (Tier-1 channel-kill + Tier-2 sliding-window wedge) is sound, and the code follows existing nouveau patterns (e.g., the `nvkm_chan_get_chid` / `nvkm_chan_error` / `nvkm_chan_put` sequence mirrors Fermi+ paths in gf100.c, ga100.c, gv100.c).

The series is relatively clean. The main issues are:

1. **German comments in the public header** — the `struct nvkm_fifo_wedge` field comments are in German. Kernel code should use English.
2. **No unwedge/reset mechanism** — once `atomic_xchg(&w->wedged, 1)` fires, Tier-2 is permanently disabled for the lifetime of the driver. This is a design point worth an explicit note or a deliberate decision to document.
3. **Wedge init/fini runs for all GPU generations** — `nv04_fifo_wedge_init`/`fini` is called unconditionally from `nvkm_fifo_new_`/`nvkm_fifo_dtor`, even though `nv04_fifo_recover` is only called from the nv04 FIFO interrupt handler. Harmless but slightly untidy.

No blocking issues found. The locking is correct, `schedule_work` from IRQ context is fine, and the workqueue indirection for `drm_dev_wedged_event` (which may sleep via `kobject_uevent_env`) is the right approach.

---

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Claude review: drm/nouveau/fifo/nv04: filter benign CACHE_ERROR from Mesa NV50 bind probe
  2026-05-13 17:50 ` [PATCH 1/2] drm/nouveau/fifo/nv04: filter benign CACHE_ERROR from Mesa NV50 bind probe Marek Czernohous
@ 2026-05-16  1:34   ` Claude Code Review Bot
  0 siblings, 0 replies; 6+ messages in thread
From: Claude Code Review Bot @ 2026-05-16  1:34 UTC (permalink / raw)
  To: dri-devel-reviews

Patch Review

**Assessment: Good — minor nits only.**

The filter is narrowly scoped and well-explained. It checks for the exact Mesa NV50 bind probe signature (method 0x0060, data 0xbeef02xx) and demotes only that case to debug level.

**Positive:** By placing the filter *before* the `else` branch that patch 2 hooks into, benign probes correctly don't count toward the Tier-2 wedge window. This is the right ordering.

**Nit 1 — debug message omits channel name.** The error path prints `chan->name` but the debug path does not:

```c
nvkm_debug(subdev, "CACHE_ERROR - ch %d subc %d mthd %04x data %08x (benign, skipped)\n",
           chid, (mthd >> 13) & 7, mthd & 0x1ffc, data);
```

This is fine for a debug-level message, but for consistency with the error message format (`ch %d [%s]`), consider including the channel name. Very minor.

**Nit 2 — comment length.** The 5-line comment block is somewhat verbose for a simple filter. The kernel style generally prefers shorter inline comments. Not a blocker.

---

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Claude review: drm/nouveau/fifo: add recovery path for Tesla cache_error/dma_pusher
  2026-05-13 17:50 ` [PATCH 2/2] drm/nouveau/fifo: add recovery path for Tesla cache_error/dma_pusher Marek Czernohous
@ 2026-05-16  1:34   ` Claude Code Review Bot
  0 siblings, 0 replies; 6+ messages in thread
From: Claude Code Review Bot @ 2026-05-16  1:34 UTC (permalink / raw)
  To: dri-devel-reviews

Patch Review

**Assessment: Functionally correct, a few items to address.**

**Issue 1 (should fix) — German comments in public header.** `include/nvkm/engine/fifo.h` is a header shared across nouveau. The struct field comments are in German:

```c
u32              count;                                  /* aktuelle Fenster-Tiefe */
ktime_t          ts[NVKM_FIFO_WEDGE_RING_MAX];          /* Ring von Timestamps */
u32              head;                                   /* Ring-Head */
```

These should be in English ("current window depth", "timestamp ring", "ring head") or removed entirely, since the field names are already reasonably descriptive.

**Issue 2 (design question) — permanent wedge with no reset path.** Once Tier-2 fires:

```c
if (atomic_xchg(&w->wedged, 1) != 0)
    return; /* already wedged this cycle */
```

The `wedged` atomic is never reset to 0. If userspace performs the `DRM_WEDGE_RECOVERY_REBIND` and the driver is re-bound, the `nvkm_fifo` structure is re-allocated so this is implicitly reset. But if the uevent is *not* acted on (no listener, or userspace ignores it), the wedge event fires exactly once and then Tier-2 is silently dead while Tier-1 keeps running. This is probably intentional, but worth documenting in a comment or the commit message.

**Issue 3 (minor) — `w->count` read without lock in worker.**

```c
static void nv04_fifo_wedge_work(struct work_struct *work)
{
    ...
    fault_count = w->count;
```

`w->count` is updated under `fifo->wedge.lock` in the IRQ path but read here without the lock. This is a benign race since the value is only used for the log message and tracepoint (not control flow), but KCSAN may flag it. A `READ_ONCE()` would silence any tooling complaint and make the intent explicit.

**Issue 4 (minor) — `nv04_fifo_wedge_init`/`fini` called for all chipsets.** These are called from `nvkm_fifo_new_` and `nvkm_fifo_dtor` which run for every GPU generation, not just Tesla. The overhead is trivial (spinlock init + work init), but naming them `nv04_*` and initializing them globally is a bit misleading. Consider either: (a) renaming to something more generic, or (b) conditioning on chipset. Very minor.

**Issue 5 (cosmetic) — dev_info prefix redundancy.**

```c
dev_info(drm_dev->dev,
         "nouveau: fifo wedged after %u faults in %u ms\n",
         fault_count, nouveau_fifo_wedge_window_ms);
```

`dev_info` already prefixes the device name. The manual `"nouveau: "` prefix is redundant and will produce output like `pci 0000:02:00.0: nouveau: fifo wedged ...`. Drop the `"nouveau: "` prefix.

**Correctness notes (all fine):**
- `schedule_work` from IRQ context: safe by design.
- `nvkm_chan_error` from IRQ context: safe — other interrupt handlers (gf100, ga100, gv100) already call it this way.
- Reuse of `flags` variable in `nv04_fifo_recover` for both the chan get/put and the spinlock: correct, the chan is released before the spin_lock.
- `ktime_t` zero as sentinel: safe since `ktime_get()` will never return 0 in practice.
- Module params declared with perm `0400` (read-only): no runtime write race possible.
- `CREATE_TRACE_POINTS` placement in `nouveau_drm.c`: correct, and this is the first trace event header for nouveau.
- Tracepoint `TRACE_INCLUDE_PATH` uses the `../../` relative path pattern: this is the standard kernel convention for out-of-tree-style trace headers.

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-05-16  1:34 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-13 17:50 [PATCH 0/2] drm/nouveau: nv04 FIFO cleanup + recovery for Tesla Marek Czernohous
2026-05-13 17:50 ` [PATCH 1/2] drm/nouveau/fifo/nv04: filter benign CACHE_ERROR from Mesa NV50 bind probe Marek Czernohous
2026-05-16  1:34   ` Claude review: " Claude Code Review Bot
2026-05-13 17:50 ` [PATCH 2/2] drm/nouveau/fifo: add recovery path for Tesla cache_error/dma_pusher Marek Czernohous
2026-05-16  1:34   ` Claude review: " Claude Code Review Bot
2026-05-16  1:34 ` Claude review: drm/nouveau: nv04 FIFO cleanup + recovery for Tesla Claude Code Review Bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox