BUG: drm/ast: soft lockup due to missing timeout in hardware polling (ast_2500_patch

public inbox for drm-ai-reviews@public-inbox.freedesktop.org
 help / color / mirror / Atom feed

* BUG: drm/ast: soft lockup due to missing timeout in hardware polling (ast_2500_patch_ahb)
@ 2026-05-13  7:36 w15303746062
  2026-05-13  7:50 ` Thomas Zimmermann
  0 siblings, 1 reply; 4+ messages in thread
From: w15303746062 @ 2026-05-13  7:36 UTC (permalink / raw)
  To: Dave Airlie, Thomas Zimmermann, Jocelyn Falempe
  Cc: Maarten Lankhorst, Maxime Ripard, David Airlie, Simona Vetter,
	dri-devel, linux-kernel, Mingyu Wang

From: Mingyu Wang <25181214217@stu.xidian.edu.cn>

Hi all,

While analyzing the AST driver's hardware interactions using our custom device emulation and fuzzing framework (DevGen), we observed a severe soft lockup in the DRM driver. 

Although the underlying trigger in our environment was an incomplete emulation of the ASPEED AHB bridge, this highlighted a critical defensive programming gap in the driver itself: a complete lack of timeout mechanisms in the hardware polling loops.

This issue causes a complete system hang (CPU stuck for 143s+) and leads to subsequent I/O starvation and system paralysis (e.g., jbd2 and systemd-journald blocked).

### Crash Log Snippet

watchdog: BUG: soft lockup - CPU#3 stuck for 143s! [systemd-udevd:162]
RIP: 0010:ioread32+0x0/0xa0
Call Trace:
 <TASK>
 __ast_read32
 __ast_mindwm+0x4d/0x80 [ast]
 ast_2500_patch_ahb+0x9d/0x120 [ast]
 ast_detect_chip 
 ast_pci_probe+0xa1a/0xb70 [ast]
 ...

### Vulnerability Analysis

Even on bare-metal hardware, an unresponsive ASPEED chip, a PCIe bus fault, or unexpected hardware states can cause the kernel to hang forever in the following loops during initialization:

1. Inside `ast_2500_patch_ahb()`:
    do {
        __ast_moutdwm(regs, 0x1e6e2000, 0x1688A8A8);
        data = __ast_mindwm(regs, 0x1e6e2000);
    } while (data != 1); // <--- Infinite loop if hardware doesn't respond properly

2. Inside the underlying I/O accessors `__ast_mindwm()` and `__ast_moutdwm()`:
    do {
        data = __ast_read32(regs, 0xf004) & 0xffff0000;
    } while (data != (r & 0xffff0000)); // <--- Infinite loop

### Proposed Fix Direction

To prevent the kernel from hanging indefinitely and to gracefully abort the probe (`-ENODEV`) upon hardware failure, these loops must implement a timeout mechanism (e.g., using `readx_poll_timeout` or a loop counter with `udelay`).

Interestingly, other functions in the same file (e.g., `mmc_test`) correctly implement a timeout counter (`if (++timeout > TIMEOUT) return false;`), but the initialization paths mentioned above blindly trust the hardware state.

We are reporting this defect so that the maintainers can decide the appropriate timeout thresholds and implement a safe fallback mechanism across the `ast` driver.

Please let us know if you need more information or the full dmesg log.

Reported-by: Mingyu Wang <25181214217@stu.xidian.edu.cn>

Best regards,
Mingyu Wang

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: BUG: drm/ast: soft lockup due to missing timeout in hardware polling (ast_2500_patch_ahb)
  2026-05-13  7:36 BUG: drm/ast: soft lockup due to missing timeout in hardware polling (ast_2500_patch_ahb) w15303746062
@ 2026-05-13  7:50 ` Thomas Zimmermann
  2026-05-13 11:39   ` [PATCH] drm/ast: Add timeouts to AHB/SCU polling loops to prevent soft lockups w15303746062
  0 siblings, 1 reply; 4+ messages in thread
From: Thomas Zimmermann @ 2026-05-13  7:50 UTC (permalink / raw)
  To: w15303746062, Dave Airlie, Jocelyn Falempe
  Cc: Maarten Lankhorst, Maxime Ripard, David Airlie, Simona Vetter,
	dri-devel, linux-kernel, Mingyu Wang

Hi

Am 13.05.26 um 09:36 schrieb w15303746062@163.com:
> From: Mingyu Wang <25181214217@stu.xidian.edu.cn>
>
> Hi all,
>
> While analyzing the AST driver's hardware interactions using our custom device emulation and fuzzing framework (DevGen), we observed a severe soft lockup in the DRM driver.
>
> Although the underlying trigger in our environment was an incomplete emulation of the ASPEED AHB bridge, this highlighted a critical defensive programming gap in the driver itself: a complete lack of timeout mechanisms in the hardware polling loops.
>
> This issue causes a complete system hang (CPU stuck for 143s+) and leads to subsequent I/O starvation and system paralysis (e.g., jbd2 and systemd-journald blocked).
>
> ### Crash Log Snippet
>
> watchdog: BUG: soft lockup - CPU#3 stuck for 143s! [systemd-udevd:162]
> RIP: 0010:ioread32+0x0/0xa0
> Call Trace:
>   <TASK>
>   __ast_read32
>   __ast_mindwm+0x4d/0x80 [ast]
>   ast_2500_patch_ahb+0x9d/0x120 [ast]
>   ast_detect_chip
>   ast_pci_probe+0xa1a/0xb70 [ast]
>   ...
>
> ### Vulnerability Analysis
>
> Even on bare-metal hardware, an unresponsive ASPEED chip, a PCIe bus fault, or unexpected hardware states can cause the kernel to hang forever in the following loops during initialization:
>
> 1. Inside `ast_2500_patch_ahb()`:
>      do {
>          __ast_moutdwm(regs, 0x1e6e2000, 0x1688A8A8);
>          data = __ast_mindwm(regs, 0x1e6e2000);
>      } while (data != 1); // <--- Infinite loop if hardware doesn't respond properly
>
> 2. Inside the underlying I/O accessors `__ast_mindwm()` and `__ast_moutdwm()`:
>      do {
>          data = __ast_read32(regs, 0xf004) & 0xffff0000;
>      } while (data != (r & 0xffff0000)); // <--- Infinite loop
>
> ### Proposed Fix Direction
>
> To prevent the kernel from hanging indefinitely and to gracefully abort the probe (`-ENODEV`) upon hardware failure, these loops must implement a timeout mechanism (e.g., using `readx_poll_timeout` or a loop counter with `udelay`).
>
> Interestingly, other functions in the same file (e.g., `mmc_test`) correctly implement a timeout counter (`if (++timeout > TIMEOUT) return false;`), but the initialization paths mentioned above blindly trust the hardware state.
>
> We are reporting this defect so that the maintainers can decide the appropriate timeout thresholds and implement a safe fallback mechanism across the `ast` driver.
>
> Please let us know if you need more information or the full dmesg log.

Thanks for reporting. Will you send a patch?

Best regards
Thomas

>
> Reported-by: Mingyu Wang <25181214217@stu.xidian.edu.cn>
>
> Best regards,
> Mingyu Wang
>

-- 
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstr. 146, 90461 Nürnberg, Germany, www.suse.com
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB 36809, AG Nürnberg)



^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH] drm/ast: Add timeouts to AHB/SCU polling loops to prevent soft lockups
  2026-05-13  7:50 ` Thomas Zimmermann
@ 2026-05-13 11:39   ` w15303746062
  2026-05-16  2:09     ` Claude review: " Claude Code Review Bot
  0 siblings, 1 reply; 4+ messages in thread
From: w15303746062 @ 2026-05-13 11:39 UTC (permalink / raw)
  To: tzimmermann, airlied, jfalempe
  Cc: maarten.lankhorst, mripard, airlied, simona, dri-devel,
	linux-kernel, Mingyu Wang

From: Mingyu Wang <25181214217@stu.xidian.edu.cn>

While validating the driver using DevGen (a framework that synthesizes
virtual device models directly from driver source code via LLM guidance),
a severe soft lockup was observed.

The hardware polling loops in `__ast_mindwm`, `__ast_moutdwm`, and
`ast_2500_patch_ahb` lack a timeout mechanism. On bare-metal systems,
if the ASPEED chip becomes unresponsive or a PCIe bus fault occurs,
the CPU will spin indefinitely in these loops. This results in a system
hang, triggering the watchdog soft lockup and causing subsequent I/O
starvation (e.g., blocking jbd2).

Fix this by introducing a bounded loop with a safe timeout of
approximately 100ms using `udelay(10)`. Using `udelay()` ensures
that the fix remains safe even if these accessors are called from
an atomic context or while holding spinlocks. If the hardware fails
to respond, the loop breaks and emits a `WARN_ONCE`, allowing the
kernel to degrade gracefully and preventing complete system paralysis.

Signed-off-by: Mingyu Wang <25181214217@stu.xidian.edu.cn>
---
Hi Thomas,

Thanks for the prompt response and confirmation!

Instead of just waiting for threshold suggestions, I have drafted this 
patch to address the soft lockup. Since changing the return types of 
`__ast_mindwm` and `__ast_moutdwm` to propagate error codes (e.g., 
`-ETIMEDOUT`) would require an intrusive refactoring across the entire 
AST driver, I took a more defensive, minimal-invasive approach.

To avoid the risk of sleeping in an atomic context (in case these 
low-level I/O accessors are ever called under a spinlock), I used 
a bounded loop with `udelay(10)` and a maximum of 10000 iterations 
(approx. 100ms total timeout). If the ASPEED hardware completely 
fails to respond, it breaks the infinite loop, emits a `WARN_ONCE`, 
and prevents the CPU from halting the entire system.

Please let me know if you think the 100ms threshold and the `udelay` 
approach are appropriate for this specific AHB/SCU hardware sequence.

 drivers/gpu/drm/ast/ast_2500.c |  8 +++++++-
 drivers/gpu/drm/ast/ast_post.c | 16 ++++++++++++++--
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/ast/ast_2500.c b/drivers/gpu/drm/ast/ast_2500.c
index 2a52af0ded56..08d18f90201a 100644
--- a/drivers/gpu/drm/ast/ast_2500.c
+++ b/drivers/gpu/drm/ast/ast_2500.c
@@ -107,6 +107,7 @@ static const u32 ast2500_ddr4_1600_timing_table[REGTBL_NUM] = {
 void ast_2500_patch_ahb(void __iomem *regs)
 {
 	u32 data;
+	int retries = 10000; /* ~100ms timeout */

 	/* Clear bus lock condition */
 	__ast_moutdwm(regs, 0x1e600000, 0xAEED1A03);
@@ -136,7 +137,12 @@ void ast_2500_patch_ahb(void __iomem *regs)
 	do {
 		__ast_moutdwm(regs, 0x1e6e2000, 0x1688A8A8);
 		data = __ast_mindwm(regs, 0x1e6e2000);
-	} while (data != 1);
+		if (data == 1)
+			break;
+		udelay(10);
+	} while (--retries);
+
+	WARN_ONCE(!retries, "ast: timeout waiting for AHB patch\n");

 	__ast_moutdwm(regs, 0x1e6e207c, 0x08000000); /* clear fast reset */
 }
diff --git a/drivers/gpu/drm/ast/ast_post.c b/drivers/gpu/drm/ast/ast_post.c
index b72914dbed38..66eb80925e27 100644
--- a/drivers/gpu/drm/ast/ast_post.c
+++ b/drivers/gpu/drm/ast/ast_post.c
@@ -37,13 +37,19 @@
 u32 __ast_mindwm(void __iomem *regs, u32 r)
 {
 	u32 data;
+	int retries = 10000; /* ~100ms timeout */

 	__ast_write32(regs, 0xf004, r & 0xffff0000);
 	__ast_write32(regs, 0xf000, 0x1);

 	do {
 		data = __ast_read32(regs, 0xf004) & 0xffff0000;
-	} while (data != (r & 0xffff0000));
+		if (data == (r & 0xffff0000))
+			break;
+		udelay(10);
+	} while (--retries);
+
+	WARN_ONCE(!retries, "ast: timeout reading from AHB/SCU\n");

 	return __ast_read32(regs, 0x10000 + (r & 0x0000ffff));
 }
@@ -51,13 +57,19 @@ u32 __ast_mindwm(void __iomem *regs, u32 r)
 void __ast_moutdwm(void __iomem *regs, u32 r, u32 v)
 {
 	u32 data;
+	int retries = 10000; /* ~100ms timeout */

 	__ast_write32(regs, 0xf004, r & 0xffff0000);
 	__ast_write32(regs, 0xf000, 0x1);

 	do {
 		data = __ast_read32(regs, 0xf004) & 0xffff0000;
-	} while (data != (r & 0xffff0000));
+		if (data == (r & 0xffff0000))
+			break;
+		udelay(10);
+	} while (--retries);
+
+	WARN_ONCE(!retries, "ast: timeout writing to AHB/SCU\n");

 	__ast_write32(regs, 0x10000 + (r & 0x0000ffff), v);
 }
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Claude review: drm/ast: Add timeouts to AHB/SCU polling loops to prevent soft lockups
  2026-05-13 11:39   ` [PATCH] drm/ast: Add timeouts to AHB/SCU polling loops to prevent soft lockups w15303746062
@ 2026-05-16  2:09     ` Claude Code Review Bot
  0 siblings, 0 replies; 4+ messages in thread
From: Claude Code Review Bot @ 2026-05-16  2:09 UTC (permalink / raw)
  To: dri-devel-reviews

Overall Series Review

Subject: drm/ast: Add timeouts to AHB/SCU polling loops to prevent soft lockups
Author: <w15303746062@163.com>
Patches: 1
Reviewed: 2026-05-16T12:09:52.113846

---

**NAK — Patch is against a stale tree and the underlying code has been refactored away.**

This single patch targets two files (`ast_post.c` and `ast_2500.c`) to add timeouts to polling loops in `__ast_mindwm`, `__ast_moutdwm`, and `ast_2500_patch_ahb`. However, the code has been substantially refactored in the current drm-next tree:

1. **`__ast_mindwm` and `__ast_moutdwm` no longer contain polling loops.** They were moved from `ast_post.c` to `ast_drv.c` and rewritten to use `__ast_selseg` / `__ast_rdseg32` / `__ast_wrseg32` helpers. The old segment-selection polling loop (`do { data = __ast_read32(...) & 0xffff0000; } while (data != ...)`) now lives in `__ast_selseg()` at `ast_drv.c:63-67`.

2. **`ast_2500_patch_ahb` was also refactored** — it now uses symbolic register names (`AST_REG_SCU000`, `AST_REG_SCU000_PROTECTION_KEY`, etc.) instead of raw hex addresses (`0x1e6e2000`, `0x1688A8A8`). The patch's context lines reference the old raw-hex code and won't apply.

3. **There is also `ast_moutdwm_poll`** at `ast_drv.c:105-115` which has yet another unbounded polling loop using `cpu_relax()` — the patch doesn't address this one.

The underlying concern (unbounded polling loops that can cause soft lockups) remains valid, but a new patch would need to target the current code structure, addressing loops in:
- `__ast_selseg()` (`ast_drv.c:63-67`)
- `ast_moutdwm_poll()` (`ast_drv.c:112-114`)
- `ast_2500_patch_ahb()` (`ast_2500.c:136-139`)

---

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-05-16  2:09 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-13  7:36 BUG: drm/ast: soft lockup due to missing timeout in hardware polling (ast_2500_patch_ahb) w15303746062
2026-05-13  7:50 ` Thomas Zimmermann
2026-05-13 11:39   ` [PATCH] drm/ast: Add timeouts to AHB/SCU polling loops to prevent soft lockups w15303746062
2026-05-16  2:09     ` Claude review: " Claude Code Review Bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox