public inbox for drm-ai-reviews@public-inbox.freedesktop.org
 help / color / mirror / Atom feed
* [PATCH 0/2] drm/imagination: Drain interrupts before suspending the GPU
@ 2026-03-10 11:41 Alessio Belle
  2026-03-10 11:41 ` [PATCH 1/2] drm/imagination: Synchronize " Alessio Belle
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Alessio Belle @ 2026-03-10 11:41 UTC (permalink / raw)
  To: Frank Binns, Matt Coster, Brajesh Gupta, Alexandru Dadu,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter
  Cc: dri-devel, linux-kernel, Alessio Belle, stable

The first commit is the actual fix and will need backporting to stable
branches, though some of the code being removed didn't exist prior to
6.16 so the patch will need adaptations.

The second commit tries to prevent similar issues and doesn't need
backporting.

Signed-off-by: Alessio Belle <alessio.belle@imgtec.com>
---
Alessio Belle (2):
      drm/imagination: Synchronize interrupts before suspending the GPU
      drm/imagination: Disable interrupts before suspending the GPU

 drivers/gpu/drm/imagination/pvr_device.c | 17 --------------
 drivers/gpu/drm/imagination/pvr_power.c  | 40 +++++++++++++++++++++++---------
 2 files changed, 29 insertions(+), 28 deletions(-)
---
base-commit: d2e20c8951e4bb5f4a828aed39813599980353b6
change-id: 20260309-drain-irqs-before-suspend-8f9c656516f4

Best regards,
-- 
Alessio Belle <alessio.belle@imgtec.com>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/2] drm/imagination: Synchronize interrupts before suspending the GPU
  2026-03-10 11:41 [PATCH 0/2] drm/imagination: Drain interrupts before suspending the GPU Alessio Belle
@ 2026-03-10 11:41 ` Alessio Belle
  2026-03-11  3:25   ` Claude review: " Claude Code Review Bot
  2026-03-10 11:41 ` [PATCH 2/2] drm/imagination: Disable " Alessio Belle
  2026-03-11  3:25 ` Claude review: drm/imagination: Drain " Claude Code Review Bot
  2 siblings, 1 reply; 6+ messages in thread
From: Alessio Belle @ 2026-03-10 11:41 UTC (permalink / raw)
  To: Frank Binns, Matt Coster, Brajesh Gupta, Alexandru Dadu,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter
  Cc: dri-devel, linux-kernel, Alessio Belle, stable

The runtime PM suspend callback doesn't know whether the IRQ handler is
in progress on a different CPU core and doesn't wait for it to finish.

Depending on timing, the IRQ handler could be running while the GPU is
suspended, leading to kernel crashes when trying to access GPU
registers. See example signature below.

In a power off sequence initiated by the runtime PM suspend callback,
wait for any IRQ handlers in progress on other CPU cores to finish, by
calling synchronize_irq().

At the same time, remove the runtime PM resume/put calls in the threaded
IRQ handler. On top of not being the right approach to begin with, and
being at the wrong place as they should have wrapped all GPU register
accesses, the driver would hit a deadlock between synchronize_irq()
being called from a runtime PM suspend callback, holding the device
power lock, and the resume callback requiring the same.

Example crash signature on a TI AM68 SK platform:

  [  337.241218] SError Interrupt on CPU0, code 0x00000000bf000000 -- SError
  [  337.241239] CPU: 0 UID: 0 PID: 112 Comm: irq/234-gpu Tainted: G   M                6.17.7-B2C-00005-g9c7bbe4ea16c #2 PREEMPT
  [  337.241246] Tainted: [M]=MACHINE_CHECK
  [  337.241249] Hardware name: Texas Instruments AM68 SK (DT)
  [  337.241252] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
  [  337.241256] pc : pvr_riscv_irq_pending+0xc/0x24
  [  337.241277] lr : pvr_device_irq_thread_handler+0x64/0x310
  [  337.241282] sp : ffff800085b0bd30
  [  337.241284] x29: ffff800085b0bd50 x28: ffff0008070d9eab x27: ffff800083a5ce10
  [  337.241291] x26: ffff000806e48f80 x25: ffff0008070d9eac x24: 0000000000000000
  [  337.241296] x23: ffff0008068e9bf0 x22: ffff0008068e9bd0 x21: ffff800085b0bd30
  [  337.241301] x20: ffff0008070d9e00 x19: ffff0008068e9000 x18: 0000000000000001
  [  337.241305] x17: 637365645f656c70 x16: 0000000000000000 x15: ffff000b7df9ff40
  [  337.241310] x14: 0000a585fe3c0d0e x13: 000000999704f060 x12: 000000000002771a
  [  337.241314] x11: 00000000000000c0 x10: 0000000000000af0 x9 : ffff800085b0bd00
  [  337.241318] x8 : ffff0008071175d0 x7 : 000000000000b955 x6 : 0000000000000003
  [  337.241323] x5 : 0000000000000000 x4 : 0000000000000002 x3 : 0000000000000000
  [  337.241327] x2 : ffff800080e39d20 x1 : ffff800080e3fc48 x0 : 0000000000000000
  [  337.241333] Kernel panic - not syncing: Asynchronous SError Interrupt
  [  337.241337] CPU: 0 UID: 0 PID: 112 Comm: irq/234-gpu Tainted: G   M                6.17.7-B2C-00005-g9c7bbe4ea16c #2 PREEMPT
  [  337.241342] Tainted: [M]=MACHINE_CHECK
  [  337.241343] Hardware name: Texas Instruments AM68 SK (DT)
  [  337.241345] Call trace:
  [  337.241348]  show_stack+0x18/0x24 (C)
  [  337.241357]  dump_stack_lvl+0x60/0x80
  [  337.241364]  dump_stack+0x18/0x24
  [  337.241368]  vpanic+0x124/0x2ec
  [  337.241373]  abort+0x0/0x4
  [  337.241377]  add_taint+0x0/0xbc
  [  337.241384]  arm64_serror_panic+0x70/0x80
  [  337.241389]  do_serror+0x3c/0x74
  [  337.241392]  el1h_64_error_handler+0x30/0x48
  [  337.241400]  el1h_64_error+0x6c/0x70
  [  337.241404]  pvr_riscv_irq_pending+0xc/0x24 (P)
  [  337.241410]  irq_thread_fn+0x2c/0xb0
  [  337.241416]  irq_thread+0x170/0x334
  [  337.241421]  kthread+0x12c/0x210
  [  337.241428]  ret_from_fork+0x10/0x20
  [  337.241434] SMP: stopping secondary CPUs
  [  337.241451] Kernel Offset: disabled
  [  337.241453] CPU features: 0x040000,02002800,20002001,0400421b
  [  337.241456] Memory Limit: none
  [  337.457921] ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---

Fixes: cc1aeedb98ad ("drm/imagination: Implement firmware infrastructure and META FW support")
Fixes: 96822d38ff57 ("drm/imagination: Handle Rogue safety event IRQs")
Cc: <stable@vger.kernel.org> # see patch description, needs adjustments for < 6.16
Signed-off-by: Alessio Belle <alessio.belle@imgtec.com>

---

96822d38ff57 only appeared in 6.16, so the change in pvr_device.c
isn't needed before that, but the rest of the fix is applicable to
earlier stable branches which contain cc1aeedb98ad.
---
 drivers/gpu/drm/imagination/pvr_device.c | 17 -----------------
 drivers/gpu/drm/imagination/pvr_power.c  | 11 ++++++++---
 2 files changed, 8 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/imagination/pvr_device.c b/drivers/gpu/drm/imagination/pvr_device.c
index f58bb66a6327..dbb6f5a8ded1 100644
--- a/drivers/gpu/drm/imagination/pvr_device.c
+++ b/drivers/gpu/drm/imagination/pvr_device.c
@@ -225,29 +225,12 @@ static irqreturn_t pvr_device_irq_thread_handler(int irq, void *data)
 	}
 
 	if (pvr_dev->has_safety_events) {
-		int err;
-
-		/*
-		 * Ensure the GPU is powered on since some safety events (such
-		 * as ECC faults) can happen outside of job submissions, which
-		 * are otherwise the only time a power reference is held.
-		 */
-		err = pvr_power_get(pvr_dev);
-		if (err) {
-			drm_err_ratelimited(drm_dev,
-					    "%s: could not take power reference (%d)\n",
-					    __func__, err);
-			return ret;
-		}
-
 		while (pvr_device_safety_irq_pending(pvr_dev)) {
 			pvr_device_safety_irq_clear(pvr_dev);
 			pvr_device_handle_safety_events(pvr_dev);
 
 			ret = IRQ_HANDLED;
 		}
-
-		pvr_power_put(pvr_dev);
 	}
 
 	return ret;
diff --git a/drivers/gpu/drm/imagination/pvr_power.c b/drivers/gpu/drm/imagination/pvr_power.c
index 7a8765c0c1ed..f50cdea30680 100644
--- a/drivers/gpu/drm/imagination/pvr_power.c
+++ b/drivers/gpu/drm/imagination/pvr_power.c
@@ -90,7 +90,7 @@ pvr_power_request_pwr_off(struct pvr_device *pvr_dev)
 }
 
 static int
-pvr_power_fw_disable(struct pvr_device *pvr_dev, bool hard_reset)
+pvr_power_fw_disable(struct pvr_device *pvr_dev, bool hard_reset, bool rpm_suspend)
 {
 	if (!hard_reset) {
 		int err;
@@ -106,6 +106,11 @@ pvr_power_fw_disable(struct pvr_device *pvr_dev, bool hard_reset)
 			return err;
 	}
 
+	if (rpm_suspend) {
+		/* Wait for late processing of GPU or firmware IRQs in other cores */
+		synchronize_irq(pvr_dev->irq);
+	}
+
 	return pvr_fw_stop(pvr_dev);
 }
 
@@ -361,7 +366,7 @@ pvr_power_device_suspend(struct device *dev)
 		return -EIO;
 
 	if (pvr_dev->fw_dev.booted) {
-		err = pvr_power_fw_disable(pvr_dev, false);
+		err = pvr_power_fw_disable(pvr_dev, false, true);
 		if (err)
 			goto err_drm_dev_exit;
 	}
@@ -518,7 +523,7 @@ pvr_power_reset(struct pvr_device *pvr_dev, bool hard_reset)
 			queues_disabled = true;
 		}
 
-		err = pvr_power_fw_disable(pvr_dev, hard_reset);
+		err = pvr_power_fw_disable(pvr_dev, hard_reset, false);
 		if (!err) {
 			if (hard_reset) {
 				pvr_dev->fw_dev.booted = false;

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/2] drm/imagination: Disable interrupts before suspending the GPU
  2026-03-10 11:41 [PATCH 0/2] drm/imagination: Drain interrupts before suspending the GPU Alessio Belle
  2026-03-10 11:41 ` [PATCH 1/2] drm/imagination: Synchronize " Alessio Belle
@ 2026-03-10 11:41 ` Alessio Belle
  2026-03-11  3:25   ` Claude review: " Claude Code Review Bot
  2026-03-11  3:25 ` Claude review: drm/imagination: Drain " Claude Code Review Bot
  2 siblings, 1 reply; 6+ messages in thread
From: Alessio Belle @ 2026-03-10 11:41 UTC (permalink / raw)
  To: Frank Binns, Matt Coster, Brajesh Gupta, Alexandru Dadu,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter
  Cc: dri-devel, linux-kernel, Alessio Belle

This is an additional safety layer to ensure no accesses to the GPU
registers can be made while it is powered off.

While we can disable IRQ generation from GPU, META firmware, MIPS
firmware and for safety events, we cannot do the same for the RISC-V
firmware.
To keep a unified approach, once the firmware has completed its power
off sequence, disable IRQs for the while GPU at the kernel level
instead.

Signed-off-by: Alessio Belle <alessio.belle@imgtec.com>
---
 drivers/gpu/drm/imagination/pvr_power.c | 33 +++++++++++++++++++++++----------
 1 file changed, 23 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/imagination/pvr_power.c b/drivers/gpu/drm/imagination/pvr_power.c
index f50cdea30680..3f22b07e3301 100644
--- a/drivers/gpu/drm/imagination/pvr_power.c
+++ b/drivers/gpu/drm/imagination/pvr_power.c
@@ -92,9 +92,9 @@ pvr_power_request_pwr_off(struct pvr_device *pvr_dev)
 static int
 pvr_power_fw_disable(struct pvr_device *pvr_dev, bool hard_reset, bool rpm_suspend)
 {
-	if (!hard_reset) {
-		int err;
+	int err;
 
+	if (!hard_reset) {
 		cancel_delayed_work_sync(&pvr_dev->watchdog.work);
 
 		err = pvr_power_request_idle(pvr_dev);
@@ -107,33 +107,46 @@ pvr_power_fw_disable(struct pvr_device *pvr_dev, bool hard_reset, bool rpm_suspe
 	}
 
 	if (rpm_suspend) {
-		/* Wait for late processing of GPU or firmware IRQs in other cores */
-		synchronize_irq(pvr_dev->irq);
+		/* This also waits for late processing of GPU or firmware IRQs in other cores */
+		disable_irq(pvr_dev->irq);
 	}
 
-	return pvr_fw_stop(pvr_dev);
+	err = pvr_fw_stop(pvr_dev);
+	if (err && rpm_suspend)
+		enable_irq(pvr_dev->irq);
+
+	return err;
 }
 
 static int
-pvr_power_fw_enable(struct pvr_device *pvr_dev)
+pvr_power_fw_enable(struct pvr_device *pvr_dev, bool rpm_resume)
 {
 	int err;
 
+	if (rpm_resume)
+		enable_irq(pvr_dev->irq);
+
 	err = pvr_fw_start(pvr_dev);
 	if (err)
-		return err;
+		goto out;
 
 	err = pvr_wait_for_fw_boot(pvr_dev);
 	if (err) {
 		drm_err(from_pvr_device(pvr_dev), "Firmware failed to boot\n");
 		pvr_fw_stop(pvr_dev);
-		return err;
+		goto out;
 	}
 
 	queue_delayed_work(pvr_dev->sched_wq, &pvr_dev->watchdog.work,
 			   msecs_to_jiffies(WATCHDOG_TIME_MS));
 
 	return 0;
+
+out:
+	if (rpm_resume)
+		disable_irq(pvr_dev->irq);
+
+	return err;
 }
 
 bool
@@ -396,7 +409,7 @@ pvr_power_device_resume(struct device *dev)
 		goto err_drm_dev_exit;
 
 	if (pvr_dev->fw_dev.booted) {
-		err = pvr_power_fw_enable(pvr_dev);
+		err = pvr_power_fw_enable(pvr_dev, true);
 		if (err)
 			goto err_power_off;
 	}
@@ -546,7 +559,7 @@ pvr_power_reset(struct pvr_device *pvr_dev, bool hard_reset)
 
 			pvr_fw_irq_clear(pvr_dev);
 
-			err = pvr_power_fw_enable(pvr_dev);
+			err = pvr_power_fw_enable(pvr_dev, false);
 		}
 
 		if (err && hard_reset)

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Claude review: drm/imagination: Drain interrupts before suspending the GPU
  2026-03-10 11:41 [PATCH 0/2] drm/imagination: Drain interrupts before suspending the GPU Alessio Belle
  2026-03-10 11:41 ` [PATCH 1/2] drm/imagination: Synchronize " Alessio Belle
  2026-03-10 11:41 ` [PATCH 2/2] drm/imagination: Disable " Alessio Belle
@ 2026-03-11  3:25 ` Claude Code Review Bot
  2 siblings, 0 replies; 6+ messages in thread
From: Claude Code Review Bot @ 2026-03-11  3:25 UTC (permalink / raw)
  To: dri-devel-reviews

Overall Series Review

Subject: drm/imagination: Drain interrupts before suspending the GPU
Author: Alessio Belle <alessio.belle@imgtec.com>
Patches: 3
Reviewed: 2026-03-11T13:25:23.778248

---

This is a well-structured 2-patch series fixing a real crash where the GPU threaded IRQ handler can race with runtime PM suspend, accessing GPU registers after the GPU is powered off. The crash signature (SError on ARM64 in `pvr_riscv_irq_pending`) clearly demonstrates the problem. The fix is sound: patch 1 adds `synchronize_irq()` as a minimal stable-backportable fix, and patch 2 strengthens it to `disable_irq()` for a more robust approach. The series is correctly split for stable backporting purposes.

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Claude review: drm/imagination: Synchronize interrupts before suspending the GPU
  2026-03-10 11:41 ` [PATCH 1/2] drm/imagination: Synchronize " Alessio Belle
@ 2026-03-11  3:25   ` Claude Code Review Bot
  0 siblings, 0 replies; 6+ messages in thread
From: Claude Code Review Bot @ 2026-03-11  3:25 UTC (permalink / raw)
  To: dri-devel-reviews

Patch Review

**Correctness: Good**

The core fix is adding `synchronize_irq()` in `pvr_power_fw_disable()` when called from the runtime PM suspend path (`rpm_suspend=true`). This ensures any in-progress threaded IRQ handler completes before `pvr_fw_stop()` shuts down the firmware and the GPU is powered off.

The removal of `pvr_power_get()`/`pvr_power_put()` from the safety event handling in the threaded IRQ handler is correct and necessary. The commit message explains this well:

1. The threaded IRQ handler is already running because an actual hardware interrupt occurred, so the GPU must be powered on at that point.
2. The `synchronize_irq()` in the suspend path blocks the suspend callback until the threaded handler finishes, so the GPU stays powered throughout the handler's execution.
3. Keeping `pvr_power_get()` in the handler would deadlock: `synchronize_irq()` is called from the runtime PM suspend callback (which holds the PM lock), and `pvr_power_get()` → `pm_runtime_resume_and_get()` requires that same lock.

**Minor observation**: The `rpm_suspend` boolean parameter added to `pvr_power_fw_disable` is a reasonable approach, though an alternative would have been to call `synchronize_irq()` directly in `pvr_power_device_suspend()`. The chosen approach keeps the synchronization close to the firmware stop, which is arguably the right abstraction level.

**Note on the hardirq handler**: The primary hardirq handler (`pvr_device_irq_handler`) also accesses GPU registers via `pvr_fw_irq_pending()` and `pvr_device_safety_irq_pending()`. There is a small theoretical window between `synchronize_irq()` returning and `pvr_fw_stop()` completing where a new hardirq could fire and access registers. Patch 2 addresses this properly.

**Stable backport note**: The Fixes/Cc-stable tags and the below-the-fold explanation about which commit affects which kernel version are well done, making backporting easier for the stable maintainers.

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Claude review: drm/imagination: Disable interrupts before suspending the GPU
  2026-03-10 11:41 ` [PATCH 2/2] drm/imagination: Disable " Alessio Belle
@ 2026-03-11  3:25   ` Claude Code Review Bot
  0 siblings, 0 replies; 6+ messages in thread
From: Claude Code Review Bot @ 2026-03-11  3:25 UTC (permalink / raw)
  To: dri-devel-reviews

Patch Review

**Correctness: Good, with a minor issue**

This patch upgrades `synchronize_irq()` to `disable_irq()` in the suspend path, which is strictly stronger: it both waits for in-progress handlers AND prevents new IRQs from being delivered. The corresponding `enable_irq()` is added to the resume path. This closes the theoretical window noted above where a new hardirq could fire after `synchronize_irq()` returns.

**IRQ disable depth accounting**: I verified the interaction with `pvr_power_reset()`, which independently calls `disable_irq()`/`enable_irq()` around its reset loop. During hard reset, it calls `pvr_power_device_suspend()` and `pvr_power_device_resume()`, but because `pvr_dev->fw_dev.booted` is set to `false` before those calls, both functions skip the `pvr_power_fw_disable()`/`pvr_power_fw_enable()` calls, avoiding double disable/enable. The accounting is correct.

**Error path in `pvr_power_fw_enable`**: The error rollback is correct — if `pvr_fw_start()` or `pvr_wait_for_fw_boot()` fails during resume, `disable_irq()` is called to re-disable the IRQ, keeping the state consistent.

**Error path in `pvr_power_fw_disable`**: Similarly, if `pvr_fw_stop()` fails during suspend, `enable_irq()` is called to restore the previous state.

**Typo in commit message**:
> disable IRQs for the while GPU at the kernel level

Should be "for the **whole** GPU".

**Minor style nit**: The `int err;` declaration is hoisted out of the `if (!hard_reset)` block to the function scope. This is necessary for the new code after the `if` block, but the `err` variable inside the `if (!hard_reset)` block is now used before the new assignment after the block. This is fine functionally, just noting the change in scoping was needed and is correct.

**Overall**: Both patches are well-motivated, correctly implemented, and the series split makes sense for backporting. The only actionable item is the "while" → "whole" typo in patch 2's commit message.

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-03-11  3:25 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-10 11:41 [PATCH 0/2] drm/imagination: Drain interrupts before suspending the GPU Alessio Belle
2026-03-10 11:41 ` [PATCH 1/2] drm/imagination: Synchronize " Alessio Belle
2026-03-11  3:25   ` Claude review: " Claude Code Review Bot
2026-03-10 11:41 ` [PATCH 2/2] drm/imagination: Disable " Alessio Belle
2026-03-11  3:25   ` Claude review: " Claude Code Review Bot
2026-03-11  3:25 ` Claude review: drm/imagination: Drain " Claude Code Review Bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox