From: Marek Czernohous <mczernohous@gmail.com>
To: nouveau@lists.freedesktop.org
Cc: Lyude Paul <lyude@redhat.com>, Danilo Krummrich <dakr@kernel.org>,
dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org,
Marek Czernohous <marek@czernohous.de>
Subject: [PATCH 0/2] drm/nouveau: nv04 FIFO cleanup + recovery for Tesla
Date: Wed, 13 May 2026 19:50:11 +0200 [thread overview]
Message-ID: <20260513175014.96599-1-marek@czernohous.de> (raw)
Hi all,
Two-patch series for the legacy nv04_fifo path covering Tesla
(MCP77/MCP79 and G80-GT218). Daily-driven on the reference NVAC
hardware (Apple Mac mini Late 2009, GeForce 9400M) since 2026-05-05.
Patch 1 demotes a benign CACHE_ERROR that fires once per Mesa session
start on Tesla GPUs. The Mesa NV50 userspace driver issues a method-
0x0060 / data-0xbeef02xx binding probe that recovers cleanly via
nv04_fifo_swmthd(), but currently logs at error level on every X or
Wayland session, dominating dmesg noise on this hardware class. This
clears the channel for patch 2 to identify real faults from noise.
Patch 2 adds a two-tier fault-recovery path for Tesla FIFO faults:
Tier 1 (per fault). Look up the channel via nvkm_chan_get_chid,
call nvkm_chan_error(chan, true), fire tracepoint
nouveau:fifo_chan_killed. Idempotent through the existing
chan->errored short-circuit.
Tier 2 (sliding window). When the per-fifo fault count in a
configurable window reaches the threshold, schedule a worker that
calls drm_dev_wedged_event(drm, DRM_WEDGE_RECOVERY_REBIND, NULL)
and fires tracepoint nouveau:fifo_dev_wedged. Worker context is
needed because kobject_uevent_env may sleep.
Motivation: Fermi+ gets channel-kill and device-wedge automatically
through nvkm_runl_rc; Tesla was feature-frozen before the DRM wedge
uAPI existed. Three observable consequences on the reference
hardware:
1. Silent state corruption (channel produces wrong output after a
fault, no notice to userspace).
2. Observability gap (no counters, tracepoints, or wedge event,
only dmesg).
3. Repeated-fault loop (the log-and-reset cycle repeats forever on
a persistently faulting channel instead of killing it).
Validation. A debugfs fault-injector (kept on a separate
DO-NOT-MERGE branch, not part of this submission) was used to drive
both Tier-1 and Tier-2 paths through their full state space. Phases
1-5 of the test plan were exercised that way. Phase 6 (no manual
injection, real workload soak) ran 2026-05-05 through 2026-05-13:
one organic DRM_WEDGE_RECOVERY_REBIND event was captured on
2026-05-05 09:08; the rest of the soak was fault-free.
Companion userland tool nouveau-pstate-daemon v0.2.0 [1] subscribes
to the WEDGED=rebind uevent in log-only mode and was used to confirm
end-to-end propagation through udev.
Module parameters:
nouveau.fifo_wedge_count (uint, 0..32, default 10)
nouveau.fifo_wedge_window_ms (uint, 100..600000, default 60000)
Setting fifo_wedge_count=0 disables Tier-2 entirely while keeping
Tier-1 channel-kill active.
A note on MAINTAINERS. The series adds a new file
drivers/gpu/drm/nouveau/nvkm/engine/fifo/recover.c. The change is
covered by the existing nouveau MAINTAINERS section
(drivers/gpu/drm/nouveau/), so no MAINTAINERS update is included.
checkpatch.pl flags this as a hint; it is not load-bearing.
This is a follow-up to the April 9 NVAC stability series [2], which
is still awaiting review. The two patches here are independent of
that series and apply against current Linus master.
[1] https://github.com/hibbes/nouveau-pstate-daemon (v0.2.0)
[2] https://lore.kernel.org/dri-devel/20260409-nouveau-nvac-stability-series
Marek Czernohous (2):
drm/nouveau/fifo/nv04: filter benign CACHE_ERROR from Mesa NV50 bind
probe
drm/nouveau/fifo: add recovery path for Tesla cache_error/dma_pusher
.../drm/nouveau/include/nvkm/engine/fifo.h | 12 ++
.../include/trace/events/nouveau_fifo.h | 58 +++++++++
drivers/gpu/drm/nouveau/nouveau_drm.c | 29 +++++
.../gpu/drm/nouveau/nvkm/engine/fifo/Kbuild | 1 +
.../gpu/drm/nouveau/nvkm/engine/fifo/base.c | 3 +
.../gpu/drm/nouveau/nvkm/engine/fifo/nv04.c | 29 ++++-
.../gpu/drm/nouveau/nvkm/engine/fifo/priv.h | 10 ++
.../drm/nouveau/nvkm/engine/fifo/recover.c | 121 ++++++++++++++++++
8 files changed, 257 insertions(+), 6 deletions(-)
create mode 100644 drivers/gpu/drm/nouveau/include/trace/events/nouveau_fifo.h
create mode 100644 drivers/gpu/drm/nouveau/nvkm/engine/fifo/recover.c
base-commit: 1f63dd8ca0dc05a8272bb8155f643c691d29bb11
--
2.53.0
next reply other threads:[~2026-05-13 17:50 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-13 17:50 Marek Czernohous [this message]
2026-05-13 17:50 ` [PATCH 1/2] drm/nouveau/fifo/nv04: filter benign CACHE_ERROR from Mesa NV50 bind probe Marek Czernohous
2026-05-16 1:34 ` Claude review: " Claude Code Review Bot
2026-05-13 17:50 ` [PATCH 2/2] drm/nouveau/fifo: add recovery path for Tesla cache_error/dma_pusher Marek Czernohous
2026-05-16 1:34 ` Claude review: " Claude Code Review Bot
2026-05-16 1:34 ` Claude review: drm/nouveau: nv04 FIFO cleanup + recovery for Tesla Claude Code Review Bot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260513175014.96599-1-marek@czernohous.de \
--to=mczernohous@gmail.com \
--cc=dakr@kernel.org \
--cc=dri-devel@lists.freedesktop.org \
--cc=linux-kernel@vger.kernel.org \
--cc=lyude@redhat.com \
--cc=marek@czernohous.de \
--cc=nouveau@lists.freedesktop.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox