[PATCH] drm/msm: always recover the gpu

public inbox for drm-ai-reviews@public-inbox.freedesktop.org
 help / color / mirror / Atom feed

* [PATCH] drm/msm: always recover the gpu
@ 2026-02-10 16:29 Anna Maniscalco
  2026-02-11  6:22 ` Claude review: " Claude Code Review Bot
  2026-02-11  6:22 ` Claude Code Review Bot
  0 siblings, 2 replies; 3+ messages in thread
From: Anna Maniscalco @ 2026-02-10 16:29 UTC (permalink / raw)
  To: Rob Clark, Sean Paul, Konrad Dybcio, Akhil P Oommen,
	Dmitry Baryshkov, Abhinav Kumar, Jessica Zhang, Marijn Suijten,
	David Airlie, Simona Vetter
  Cc: linux-arm-msm, dri-devel, freedreno, linux-kernel,
	Anna Maniscalco, stable

Previously, in case there was no more work to do, recover worker
wouldn't trigger recovery and would instead rely on the gpu going to
sleep and then resuming when more work is submitted.

Recover_worker will first increment the fence of the hung ring so, if
there's only one job submitted to a ring and that causes an hang, it
will early out.

There's no guarantee that the gpu will suspend and resume before more
work is submitted and if the gpu is in a hung state it will stay in that
state and probably trigger a timeout again.

Just stop checking and always recover the gpu.

Signed-off-by: Anna Maniscalco <anna.maniscalco2000@gmail.com>
Cc: stable@vger.kernel.org
---
 drivers/gpu/drm/msm/msm_gpu.c | 42 ++++++++++++++++++++----------------------
 1 file changed, 20 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index 995549d0bbbc..ea3e79670f75 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -547,32 +547,30 @@ static void recover_worker(struct kthread_work *work)
 		msm_update_fence(ring->fctx, fence);
 	}
 
-	if (msm_gpu_active(gpu)) {
-		/* retire completed submits, plus the one that hung: */
-		retire_submits(gpu);
+	/* retire completed submits, plus the one that hung: */
+	retire_submits(gpu);
 
-		gpu->funcs->recover(gpu);
+	gpu->funcs->recover(gpu);
 
-		/*
-		 * Replay all remaining submits starting with highest priority
-		 * ring
-		 */
-		for (i = 0; i < gpu->nr_rings; i++) {
-			struct msm_ringbuffer *ring = gpu->rb[i];
-			unsigned long flags;
+	/*
+	 * Replay all remaining submits starting with highest priority
+	 * ring
+	 */
+	for (i = 0; i < gpu->nr_rings; i++) {
+		struct msm_ringbuffer *ring = gpu->rb[i];
+		unsigned long flags;
 
-			spin_lock_irqsave(&ring->submit_lock, flags);
-			list_for_each_entry(submit, &ring->submits, node) {
-				/*
-				 * If the submit uses an unusable vm make sure
-				 * we don't actually run it
-				 */
-				if (to_msm_vm(submit->vm)->unusable)
-					submit->nr_cmds = 0;
-				gpu->funcs->submit(gpu, submit);
-			}
-			spin_unlock_irqrestore(&ring->submit_lock, flags);
+		spin_lock_irqsave(&ring->submit_lock, flags);
+		list_for_each_entry(submit, &ring->submits, node) {
+			/*
+			 * If the submit uses an unusable vm make sure
+			 * we don't actually run it
+			 */
+			if (to_msm_vm(submit->vm)->unusable)
+				submit->nr_cmds = 0;
+			gpu->funcs->submit(gpu, submit);
 		}
+		spin_unlock_irqrestore(&ring->submit_lock, flags);
 	}
 
 	pm_runtime_put(&gpu->pdev->dev);

---
base-commit: 50c4a49f7292b33b454ea1a16c4f77d6965405dc
change-id: 20260210-recovery_suspend_fix-77a65932fa4c

Best regards,
-- 
Anna Maniscalco <anna.maniscalco2000@gmail.com>



^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Claude review: drm/msm: always recover the gpu
  2026-02-10 16:29 [PATCH] drm/msm: always recover the gpu Anna Maniscalco
@ 2026-02-11  6:22 ` Claude Code Review Bot
  2026-02-11  6:22 ` Claude Code Review Bot
  1 sibling, 0 replies; 3+ messages in thread
From: Claude Code Review Bot @ 2026-02-11  6:22 UTC (permalink / raw)
  To: dri-devel-reviews

Patch Review

**Commit Message Analysis:**

The commit message describes the issue but has some clarity problems:

```
Recover_worker will first increment the fence of the hung ring so, if
there's only one job submitted to a ring and that causes an hang, it
will early out.
```

This sentence is unclear. I believe you're saying that after incrementing the fence in the earlier code (lines 137-138 in context), the GPU may appear to have no active work, causing `msm_gpu_active()` to return false. Please clarify this logic flow more explicitly.

```
There's no guarantee that the gpu will suspend and resume before more
work is submitted and if the gpu is in a hung state it will stay in that
state and probably trigger a timeout again.
```

This is the core issue but needs evidence: Have you observed this behavior? What testing demonstrated this? The claim that "it will stay in that state" is critical to justifying this change.

**Technical Analysis:**

The change removes 28 lines (a conditional wrapper) and the code just gets de-indented. Let's examine the implications:

```c
-	if (msm_gpu_active(gpu)) {
-		/* retire completed submits, plus the one that hung: */
-		retire_submits(gpu);
+	/* retire completed submits, plus the one that hung: */
+	retire_submits(gpu);

-		gpu->funcs->recover(gpu);
+	gpu->funcs->recover(gpu);
```

**Questions/Concerns:**

1. **Power management impact:** The original code appeared to rely on suspend/resume to clear the GPU state when idle. By removing this optimization:
   - Will this cause unnecessary recovery operations on idle GPUs?
   - What's the power consumption impact of calling `gpu->funcs->recover()` unconditionally?
   - Does `recover()` do expensive operations even when the GPU is idle?

2. **Race condition analysis:** The commit message mentions "no guarantee that the gpu will suspend and resume before more work is submitted." However:
   - What prevents the GPU from suspending *after* you check `msm_gpu_active()` but *before* calling `recover()`?
   - If work can arrive at any time, wouldn't the same race exist with the original code?
   - Is there locking that prevents work submission during recovery?

3. **Historical context missing:**
   - Why was the `msm_gpu_active()` check added originally? 
   - Was it a power optimization, or was it trying to avoid some other issue?
   - Have you checked `git blame` and the original commit message?

4. **Testing evidence:**
   - What specific test case triggers this bug?
   - Have you verified this fixes the issue?
   - What regression testing has been done to ensure this doesn't break suspend/resume?
   - CC'ing stable suggests this is a real bug - what's the user-visible impact?

**Code-level observations:**

```c
	pm_runtime_put(&gpu->pdev->dev);
```

At line 188 (after your changes), there's a pm_runtime_put(). This pairs with a pm_runtime_get earlier in the function. By always running the recovery path, you're ensuring the GPU stays powered up longer. Is this intentional and acceptable?

The code flow after your change:
1. Increment fences (marks work as complete)
2. Retire submits (may remove all pending work)
3. **ALWAYS** call `gpu->funcs->recover()` (even if GPU now appears idle)
4. Replay all submits from all rings (may be empty lists)
5. Power management put

**Alternative approaches to consider:**

1. Could you check if there are actually pending submits *before* deciding to skip recovery?
2. Should the fence increment happen *after* the active check, not before?
3. Is there a more targeted fix that addresses the specific race without removing the optimization entirely?

**Style/formatting:** The code change itself is clean - pure de-indentation with no logic changes beyond removing the conditional.

**Stability tag:** The `Cc: stable@vger.kernel.org` tag is present but no `Fixes:` tag. If this fixes a specific commit, please add:
```
Fixes: <commit-id> ("original commit title")
```

**Verdict:** The change may be correct, but the justification is insufficient. Before this can be accepted, please provide:

1. More detailed explanation of the race condition or state persistence issue
2. Evidence from testing (logs, traces, reproducers)
3. Analysis of the power management impact
4. Historical context on why the check existed
5. A `Fixes:` tag if this regresses a specific commit

The core insight (that recovery should run even if the GPU appears idle after fence increment) seems reasonable, but needs stronger supporting evidence and impact analysis.

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Claude review: drm/msm: always recover the gpu
  2026-02-10 16:29 [PATCH] drm/msm: always recover the gpu Anna Maniscalco
  2026-02-11  6:22 ` Claude review: " Claude Code Review Bot
@ 2026-02-11  6:22 ` Claude Code Review Bot
  1 sibling, 0 replies; 3+ messages in thread
From: Claude Code Review Bot @ 2026-02-11  6:22 UTC (permalink / raw)
  To: dri-devel-reviews

Overall Series Review

Subject: drm/msm: always recover the gpu
Author: Anna Maniscalco <anna.maniscalco2000@gmail.com>
Patches: 1
Reviewed: 2026-02-11T16:22:11.633679

---

This is a single-patch series addressing a GPU recovery bug in the MSM DRM driver. The patch removes a conditional check (`if (msm_gpu_active(gpu))`) that was preventing GPU recovery from executing when the GPU appeared idle after a hang.

**Summary:**
- **Problem identified:** The recover_worker could skip recovery if the GPU appeared idle after incrementing fences, creating a state where the GPU remains hung
- **Solution approach:** Remove the `msm_gpu_active()` check to ensure recovery always executes
- **Risk assessment:** Medium - changes critical error recovery path
- **Testing concerns:** Needs validation across suspend/resume cycles and various workload scenarios

**Key concerns:**
1. The patch removes a power management optimization but doesn't explain all implications
2. Missing analysis of why `msm_gpu_active()` check was originally added
3. Needs testing evidence to support the claim about GPU state persistence
4. Power consumption impact not addressed

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-02-11  6:22 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-10 16:29 [PATCH] drm/msm: always recover the gpu Anna Maniscalco
2026-02-11  6:22 ` Claude review: " Claude Code Review Bot
2026-02-11  6:22 ` Claude Code Review Bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox