From mboxrd@z Thu Jan 1 00:00:00 1970 From: Claude Code Review Bot To: dri-devel-reviews@example.com Subject: Claude review: drm: msm: adreno: attempt to recover from ringbuffer drain timeout Date: Mon, 09 Mar 2026 07:37:27 +1000 Message-ID: In-Reply-To: <20260308-adreno-ringbuffer-drain-timeout-recovery-v1-1-985a33faf108@postmarketos.org> References: <20260308-adreno-ringbuffer-drain-timeout-recovery-v1-1-985a33faf108@postmarketos.org> <20260308-adreno-ringbuffer-drain-timeout-recovery-v1-1-985a33faf108@postmarketos.org> X-Mailer: Claude Code Patch Reviewer Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Patch Review **Critical issues:** 1. **Missing locking context.** The `recover_worker()` in `msm_gpu.c:463` t= akes `mutex_lock(&gpu->lock)` before calling `gpu->funcs->recover(gpu)`. Th= e new call in `adreno_idle()` calls `recover()` without holding `gpu->lock`= . Many callers of `adreno_idle()` (e.g., during `hw_init`, `pm_suspend`, CP= register writes) are not necessarily holding this lock, and some may alrea= dy hold it =E2=80=94 leading to either missing synchronization or deadlocks. 2. **Bypasses critical recovery bookkeeping.** The `recover_worker()` does = substantial work before calling `gpu->funcs->recover()`: - It finds the faulting submit (`find_submit()`) - Increments fault counters - Captures crash state (`msm_gpu_crashstate_capture()`) - Advances fences to skip the hung submit (`ring->memptrs->fence =3D ++f= ence`) - Retires completed submits (`retire_submits()`) - Replays remaining submits after recovery The patch skips all of this. Without fence advancement and submit retire= ment, the GPU will likely re-execute the same faulting command after recove= ry, causing an infinite hang-recover loop. 3. **Calling recover from within idle is architecturally wrong.** `adreno_i= dle()` is called from many contexts including `hw_init` paths. For example,= `a5xx_gpu.c:975` calls `a5xx_idle()` during `a5xx_hw_init()`. The `adreno_= recover()` function itself calls `msm_gpu_hw_init()` (`adreno_gpu.c:709`), = which would re-enter the init path and call idle again, creating **infinite= recursion**. 4. **No pm_runtime handling.** The `recover_worker` does `pm_runtime_get_sy= nc()` before calling recover and `pm_runtime_put()` after. The patch has no= such protection. 5. **The actual call:** ```c adreno_gpu->funcs->base.recover(gpu); ``` This calls the per-generation recover (e.g., `a5xx_recover`), not the ba= se `adreno_recover()`. The per-gen recover functions (e.g., `a4xx_recover` = at `a4xx_gpu.c:350`) do things like dump registers, which may have their ow= n locking assumptions. **Recommendation:** The right approach is to schedule recovery through the = existing `recover_worker` mechanism rather than calling `recover()` directl= y: ```c kthread_queue_work(gpu->worker, &gpu->recover_work); ``` This is what the fault/hang IRQ handlers already do (see `a8xx_gpu.c:939`, = `a5xx_preempt.c:91`, etc.). It ensures proper locking, fence management, cr= ashstate capture, and submit replay all happen correctly. However, even tha= t needs care =E2=80=94 `adreno_idle()` callers currently expect synchronous= behavior (they check the return value), so queuing async recovery would re= quire adjusting the callers too. --- Generated by Claude Code Patch Reviewer