[PATCH] drm: msm: adreno: attempt to recover from ringbuffer drain timeout

public inbox for drm-ai-reviews@public-inbox.freedesktop.org
 help / color / mirror / Atom feed

* [PATCH] drm: msm: adreno: attempt to recover from ringbuffer drain timeout
@ 2026-03-08 21:12 Paul Sajna
  2026-03-08 21:37 ` Claude review: " Claude Code Review Bot
  2026-03-08 21:37 ` Claude Code Review Bot
  0 siblings, 2 replies; 3+ messages in thread
From: Paul Sajna @ 2026-03-08 21:12 UTC (permalink / raw)
  To: Rob Clark, Sean Paul, Konrad Dybcio, Akhil P Oommen,
	Dmitry Baryshkov, Abhinav Kumar, Jessica Zhang, Marijn Suijten,
	David Airlie, Simona Vetter, Alexey Minnekhanov
  Cc: linux-arm-msm, dri-devel, freedreno, linux-kernel, phone-devel,
	~postmarketos/upstreaming, Paul Sajna

I found a 13-year-old TODO while debugging gpu stalls on sdm6xx/a5xx
and thought I might as well try to implement it. It doesn't fully
resolve all stalls in the driver, but it's a start.

[drm:adreno_idle [msm]] *ERROR* 5.0.9.0: timeout waiting to drain ringbuffer 0 rptr/wptr = 32C/C
msm_dpu c901000.display-controller: CP | opcode error | possible opcode=0x00000000
msm_dpu c901000.display-controller: [drm:a5xx_irq [msm]] *ERROR* gpu fault ring 0 fence 29 status 800001C1 rb 0380/000c ib1 0000000001898000/0000 ib2 000000000366D000/0000
[drm:adreno_idle [msm]] *ERROR* 5.0.9.0: timeout waiting to drain ringbuffer 0 rptr/wptr = 32C/C
msm_dpu c901000.display-controller: [drm:a5xx_irq [msm]] *ERROR* gpu fault ring 0 fence 29 status 800001C1 rb 000c/000c ib1 0000000001898000/0000 ib2 000000000366D000/0000
[drm:adreno_idle [msm]] *ERROR* 5.0.9.0: timeout waiting to drain ringbuffer 0 rptr/wptr = 32C/C
msm_dpu c901000.display-controller: [drm:a5xx_irq [msm]] *ERROR* gpu fault ring 0 fence 29 status 800001C1 rb 0051/000c ib1 0000000001898000/0000 ib2 000000000366D000/0000
[drm:adreno_idle [msm]] *ERROR* 5.0.9.0: timeout waiting to drain ringbuffer 0 rptr/wptr = 32C/C
msm_dpu c901000.display-controller: [drm:recover_worker [msm]] *ERROR* 5.0.9.0: hangcheck recover!
msm_dpu c901000.display-controller: [drm:a5xx_irq [msm]] *ERROR* gpu fault ring 0 fence 29 status 800001C1 rb 000c/000c ib1 0000000001898000/0000 ib2 000000000366D000/0000
msm_dpu c901000.display-controller: [drm:recover_worker [msm]] *ERROR* 5.0.9.0: offending task: sway (sway -c /home/user/.config/sxmo/sway)
watchdog: CPU1: Watchdog detected hard LOCKUP on cpu 2

Signed-off-by: Paul Sajna <sajattack@postmarketos.org>
---
 drivers/gpu/drm/msm/adreno/adreno_gpu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
index d5fe6f6f0dec..77cda368eba1 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
@@ -742,10 +742,11 @@ bool adreno_idle(struct msm_gpu *gpu, struct msm_ringbuffer *ring)
 	if (!spin_until(get_rptr(adreno_gpu, ring) == wptr))
 		return true;
 
-	/* TODO maybe we need to reset GPU here to recover from hang? */
 	DRM_ERROR("%s: timeout waiting to drain ringbuffer %d rptr/wptr = %X/%X\n",
 		gpu->name, ring->id, get_rptr(adreno_gpu, ring), wptr);
 
+	adreno_gpu->funcs->base.recover(gpu);
+
 	return false;
 }
 

---
base-commit: 52584178a10aa82d80aadda690f4bbc76d92ddda
change-id: 20260308-adreno-ringbuffer-drain-timeout-recovery-617ea69813fc

Best regards,
-- 
Paul Sajna <sajattack@postmarketos.org>


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Claude review: drm: msm: adreno: attempt to recover from ringbuffer drain timeout
  2026-03-08 21:12 [PATCH] drm: msm: adreno: attempt to recover from ringbuffer drain timeout Paul Sajna
  2026-03-08 21:37 ` Claude review: " Claude Code Review Bot
@ 2026-03-08 21:37 ` Claude Code Review Bot
  1 sibling, 0 replies; 3+ messages in thread
From: Claude Code Review Bot @ 2026-03-08 21:37 UTC (permalink / raw)
  To: dri-devel-reviews

Overall Series Review

Subject: drm: msm: adreno: attempt to recover from ringbuffer drain timeout
Author: Paul Sajna <sajattack@postmarketos.org>
Patches: 1
Reviewed: 2026-03-09T07:37:27.592702

---

This is a single patch that attempts to address a 13-year-old TODO in the adreno GPU driver by calling `recover()` when a ringbuffer drain timeout occurs in `adreno_idle()`. While the intent is reasonable, the implementation has **serious problems** that would likely cause kernel crashes, deadlocks, or other undefined behavior. The `recover()` path is designed to be called from `recover_worker()` under specific locking and state management conditions that are not met when calling it directly from `adreno_idle()`.

**Not suitable for merging as-is.** The approach needs significant rethinking.

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Claude review: drm: msm: adreno: attempt to recover from ringbuffer drain timeout
  2026-03-08 21:12 [PATCH] drm: msm: adreno: attempt to recover from ringbuffer drain timeout Paul Sajna
@ 2026-03-08 21:37 ` Claude Code Review Bot
  2026-03-08 21:37 ` Claude Code Review Bot
  1 sibling, 0 replies; 3+ messages in thread
From: Claude Code Review Bot @ 2026-03-08 21:37 UTC (permalink / raw)
  To: dri-devel-reviews

Patch Review

**Critical issues:**

1. **Missing locking context.** The `recover_worker()` in `msm_gpu.c:463` takes `mutex_lock(&gpu->lock)` before calling `gpu->funcs->recover(gpu)`. The new call in `adreno_idle()` calls `recover()` without holding `gpu->lock`. Many callers of `adreno_idle()` (e.g., during `hw_init`, `pm_suspend`, CP register writes) are not necessarily holding this lock, and some may already hold it — leading to either missing synchronization or deadlocks.

2. **Bypasses critical recovery bookkeeping.** The `recover_worker()` does substantial work before calling `gpu->funcs->recover()`:
   - It finds the faulting submit (`find_submit()`)
   - Increments fault counters
   - Captures crash state (`msm_gpu_crashstate_capture()`)
   - Advances fences to skip the hung submit (`ring->memptrs->fence = ++fence`)
   - Retires completed submits (`retire_submits()`)
   - Replays remaining submits after recovery

   The patch skips all of this. Without fence advancement and submit retirement, the GPU will likely re-execute the same faulting command after recovery, causing an infinite hang-recover loop.

3. **Calling recover from within idle is architecturally wrong.** `adreno_idle()` is called from many contexts including `hw_init` paths. For example, `a5xx_gpu.c:975` calls `a5xx_idle()` during `a5xx_hw_init()`. The `adreno_recover()` function itself calls `msm_gpu_hw_init()` (`adreno_gpu.c:709`), which would re-enter the init path and call idle again, creating **infinite recursion**.

4. **No pm_runtime handling.** The `recover_worker` does `pm_runtime_get_sync()` before calling recover and `pm_runtime_put()` after. The patch has no such protection.

5. **The actual call:**
   ```c
   adreno_gpu->funcs->base.recover(gpu);
   ```
   This calls the per-generation recover (e.g., `a5xx_recover`), not the base `adreno_recover()`. The per-gen recover functions (e.g., `a4xx_recover` at `a4xx_gpu.c:350`) do things like dump registers, which may have their own locking assumptions.

**Recommendation:** The right approach is to schedule recovery through the existing `recover_worker` mechanism rather than calling `recover()` directly:

```c
kthread_queue_work(gpu->worker, &gpu->recover_work);
```

This is what the fault/hang IRQ handlers already do (see `a8xx_gpu.c:939`, `a5xx_preempt.c:91`, etc.). It ensures proper locking, fence management, crashstate capture, and submit replay all happen correctly. However, even that needs care — `adreno_idle()` callers currently expect synchronous behavior (they check the return value), so queuing async recovery would require adjusting the callers too.

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-03-08 21:37 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-08 21:12 [PATCH] drm: msm: adreno: attempt to recover from ringbuffer drain timeout Paul Sajna
2026-03-08 21:37 ` Claude review: " Claude Code Review Bot
2026-03-08 21:37 ` Claude Code Review Bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox