* BUG: drm/ast: soft lockup due to missing timeout in hardware polling (ast_2500_patch_ahb)
@ 2026-05-13 7:36 w15303746062
2026-05-13 7:50 ` Thomas Zimmermann
0 siblings, 1 reply; 4+ messages in thread
From: w15303746062 @ 2026-05-13 7:36 UTC (permalink / raw)
To: Dave Airlie, Thomas Zimmermann, Jocelyn Falempe
Cc: Maarten Lankhorst, Maxime Ripard, David Airlie, Simona Vetter,
dri-devel, linux-kernel, Mingyu Wang
From: Mingyu Wang <25181214217@stu.xidian.edu.cn>
Hi all,
While analyzing the AST driver's hardware interactions using our custom device emulation and fuzzing framework (DevGen), we observed a severe soft lockup in the DRM driver.
Although the underlying trigger in our environment was an incomplete emulation of the ASPEED AHB bridge, this highlighted a critical defensive programming gap in the driver itself: a complete lack of timeout mechanisms in the hardware polling loops.
This issue causes a complete system hang (CPU stuck for 143s+) and leads to subsequent I/O starvation and system paralysis (e.g., jbd2 and systemd-journald blocked).
### Crash Log Snippet
watchdog: BUG: soft lockup - CPU#3 stuck for 143s! [systemd-udevd:162]
RIP: 0010:ioread32+0x0/0xa0
Call Trace:
<TASK>
__ast_read32
__ast_mindwm+0x4d/0x80 [ast]
ast_2500_patch_ahb+0x9d/0x120 [ast]
ast_detect_chip
ast_pci_probe+0xa1a/0xb70 [ast]
...
### Vulnerability Analysis
Even on bare-metal hardware, an unresponsive ASPEED chip, a PCIe bus fault, or unexpected hardware states can cause the kernel to hang forever in the following loops during initialization:
1. Inside `ast_2500_patch_ahb()`:
do {
__ast_moutdwm(regs, 0x1e6e2000, 0x1688A8A8);
data = __ast_mindwm(regs, 0x1e6e2000);
} while (data != 1); // <--- Infinite loop if hardware doesn't respond properly
2. Inside the underlying I/O accessors `__ast_mindwm()` and `__ast_moutdwm()`:
do {
data = __ast_read32(regs, 0xf004) & 0xffff0000;
} while (data != (r & 0xffff0000)); // <--- Infinite loop
### Proposed Fix Direction
To prevent the kernel from hanging indefinitely and to gracefully abort the probe (`-ENODEV`) upon hardware failure, these loops must implement a timeout mechanism (e.g., using `readx_poll_timeout` or a loop counter with `udelay`).
Interestingly, other functions in the same file (e.g., `mmc_test`) correctly implement a timeout counter (`if (++timeout > TIMEOUT) return false;`), but the initialization paths mentioned above blindly trust the hardware state.
We are reporting this defect so that the maintainers can decide the appropriate timeout thresholds and implement a safe fallback mechanism across the `ast` driver.
Please let us know if you need more information or the full dmesg log.
Reported-by: Mingyu Wang <25181214217@stu.xidian.edu.cn>
Best regards,
Mingyu Wang
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: BUG: drm/ast: soft lockup due to missing timeout in hardware polling (ast_2500_patch_ahb) 2026-05-13 7:36 BUG: drm/ast: soft lockup due to missing timeout in hardware polling (ast_2500_patch_ahb) w15303746062 @ 2026-05-13 7:50 ` Thomas Zimmermann 2026-05-13 11:39 ` [PATCH] drm/ast: Add timeouts to AHB/SCU polling loops to prevent soft lockups w15303746062 0 siblings, 1 reply; 4+ messages in thread From: Thomas Zimmermann @ 2026-05-13 7:50 UTC (permalink / raw) To: w15303746062, Dave Airlie, Jocelyn Falempe Cc: Maarten Lankhorst, Maxime Ripard, David Airlie, Simona Vetter, dri-devel, linux-kernel, Mingyu Wang Hi Am 13.05.26 um 09:36 schrieb w15303746062@163.com: > From: Mingyu Wang <25181214217@stu.xidian.edu.cn> > > Hi all, > > While analyzing the AST driver's hardware interactions using our custom device emulation and fuzzing framework (DevGen), we observed a severe soft lockup in the DRM driver. > > Although the underlying trigger in our environment was an incomplete emulation of the ASPEED AHB bridge, this highlighted a critical defensive programming gap in the driver itself: a complete lack of timeout mechanisms in the hardware polling loops. > > This issue causes a complete system hang (CPU stuck for 143s+) and leads to subsequent I/O starvation and system paralysis (e.g., jbd2 and systemd-journald blocked). > > ### Crash Log Snippet > > watchdog: BUG: soft lockup - CPU#3 stuck for 143s! [systemd-udevd:162] > RIP: 0010:ioread32+0x0/0xa0 > Call Trace: > <TASK> > __ast_read32 > __ast_mindwm+0x4d/0x80 [ast] > ast_2500_patch_ahb+0x9d/0x120 [ast] > ast_detect_chip > ast_pci_probe+0xa1a/0xb70 [ast] > ... > > ### Vulnerability Analysis > > Even on bare-metal hardware, an unresponsive ASPEED chip, a PCIe bus fault, or unexpected hardware states can cause the kernel to hang forever in the following loops during initialization: > > 1. Inside `ast_2500_patch_ahb()`: > do { > __ast_moutdwm(regs, 0x1e6e2000, 0x1688A8A8); > data = __ast_mindwm(regs, 0x1e6e2000); > } while (data != 1); // <--- Infinite loop if hardware doesn't respond properly > > 2. Inside the underlying I/O accessors `__ast_mindwm()` and `__ast_moutdwm()`: > do { > data = __ast_read32(regs, 0xf004) & 0xffff0000; > } while (data != (r & 0xffff0000)); // <--- Infinite loop > > ### Proposed Fix Direction > > To prevent the kernel from hanging indefinitely and to gracefully abort the probe (`-ENODEV`) upon hardware failure, these loops must implement a timeout mechanism (e.g., using `readx_poll_timeout` or a loop counter with `udelay`). > > Interestingly, other functions in the same file (e.g., `mmc_test`) correctly implement a timeout counter (`if (++timeout > TIMEOUT) return false;`), but the initialization paths mentioned above blindly trust the hardware state. > > We are reporting this defect so that the maintainers can decide the appropriate timeout thresholds and implement a safe fallback mechanism across the `ast` driver. > > Please let us know if you need more information or the full dmesg log. Thanks for reporting. Will you send a patch? Best regards Thomas > > Reported-by: Mingyu Wang <25181214217@stu.xidian.edu.cn> > > Best regards, > Mingyu Wang > -- -- Thomas Zimmermann Graphics Driver Developer SUSE Software Solutions Germany GmbH Frankenstr. 146, 90461 Nürnberg, Germany, www.suse.com GF: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB 36809, AG Nürnberg) ^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH] drm/ast: Add timeouts to AHB/SCU polling loops to prevent soft lockups 2026-05-13 7:50 ` Thomas Zimmermann @ 2026-05-13 11:39 ` w15303746062 2026-05-16 2:09 ` Claude review: " Claude Code Review Bot 0 siblings, 1 reply; 4+ messages in thread From: w15303746062 @ 2026-05-13 11:39 UTC (permalink / raw) To: tzimmermann, airlied, jfalempe Cc: maarten.lankhorst, mripard, airlied, simona, dri-devel, linux-kernel, Mingyu Wang From: Mingyu Wang <25181214217@stu.xidian.edu.cn> While validating the driver using DevGen (a framework that synthesizes virtual device models directly from driver source code via LLM guidance), a severe soft lockup was observed. The hardware polling loops in `__ast_mindwm`, `__ast_moutdwm`, and `ast_2500_patch_ahb` lack a timeout mechanism. On bare-metal systems, if the ASPEED chip becomes unresponsive or a PCIe bus fault occurs, the CPU will spin indefinitely in these loops. This results in a system hang, triggering the watchdog soft lockup and causing subsequent I/O starvation (e.g., blocking jbd2). Fix this by introducing a bounded loop with a safe timeout of approximately 100ms using `udelay(10)`. Using `udelay()` ensures that the fix remains safe even if these accessors are called from an atomic context or while holding spinlocks. If the hardware fails to respond, the loop breaks and emits a `WARN_ONCE`, allowing the kernel to degrade gracefully and preventing complete system paralysis. Signed-off-by: Mingyu Wang <25181214217@stu.xidian.edu.cn> --- Hi Thomas, Thanks for the prompt response and confirmation! Instead of just waiting for threshold suggestions, I have drafted this patch to address the soft lockup. Since changing the return types of `__ast_mindwm` and `__ast_moutdwm` to propagate error codes (e.g., `-ETIMEDOUT`) would require an intrusive refactoring across the entire AST driver, I took a more defensive, minimal-invasive approach. To avoid the risk of sleeping in an atomic context (in case these low-level I/O accessors are ever called under a spinlock), I used a bounded loop with `udelay(10)` and a maximum of 10000 iterations (approx. 100ms total timeout). If the ASPEED hardware completely fails to respond, it breaks the infinite loop, emits a `WARN_ONCE`, and prevents the CPU from halting the entire system. Please let me know if you think the 100ms threshold and the `udelay` approach are appropriate for this specific AHB/SCU hardware sequence. drivers/gpu/drm/ast/ast_2500.c | 8 +++++++- drivers/gpu/drm/ast/ast_post.c | 16 ++++++++++++++-- 2 files changed, 21 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/ast/ast_2500.c b/drivers/gpu/drm/ast/ast_2500.c index 2a52af0ded56..08d18f90201a 100644 --- a/drivers/gpu/drm/ast/ast_2500.c +++ b/drivers/gpu/drm/ast/ast_2500.c @@ -107,6 +107,7 @@ static const u32 ast2500_ddr4_1600_timing_table[REGTBL_NUM] = { void ast_2500_patch_ahb(void __iomem *regs) { u32 data; + int retries = 10000; /* ~100ms timeout */ /* Clear bus lock condition */ __ast_moutdwm(regs, 0x1e600000, 0xAEED1A03); @@ -136,7 +137,12 @@ void ast_2500_patch_ahb(void __iomem *regs) do { __ast_moutdwm(regs, 0x1e6e2000, 0x1688A8A8); data = __ast_mindwm(regs, 0x1e6e2000); - } while (data != 1); + if (data == 1) + break; + udelay(10); + } while (--retries); + + WARN_ONCE(!retries, "ast: timeout waiting for AHB patch\n"); __ast_moutdwm(regs, 0x1e6e207c, 0x08000000); /* clear fast reset */ } diff --git a/drivers/gpu/drm/ast/ast_post.c b/drivers/gpu/drm/ast/ast_post.c index b72914dbed38..66eb80925e27 100644 --- a/drivers/gpu/drm/ast/ast_post.c +++ b/drivers/gpu/drm/ast/ast_post.c @@ -37,13 +37,19 @@ u32 __ast_mindwm(void __iomem *regs, u32 r) { u32 data; + int retries = 10000; /* ~100ms timeout */ __ast_write32(regs, 0xf004, r & 0xffff0000); __ast_write32(regs, 0xf000, 0x1); do { data = __ast_read32(regs, 0xf004) & 0xffff0000; - } while (data != (r & 0xffff0000)); + if (data == (r & 0xffff0000)) + break; + udelay(10); + } while (--retries); + + WARN_ONCE(!retries, "ast: timeout reading from AHB/SCU\n"); return __ast_read32(regs, 0x10000 + (r & 0x0000ffff)); } @@ -51,13 +57,19 @@ u32 __ast_mindwm(void __iomem *regs, u32 r) void __ast_moutdwm(void __iomem *regs, u32 r, u32 v) { u32 data; + int retries = 10000; /* ~100ms timeout */ __ast_write32(regs, 0xf004, r & 0xffff0000); __ast_write32(regs, 0xf000, 0x1); do { data = __ast_read32(regs, 0xf004) & 0xffff0000; - } while (data != (r & 0xffff0000)); + if (data == (r & 0xffff0000)) + break; + udelay(10); + } while (--retries); + + WARN_ONCE(!retries, "ast: timeout writing to AHB/SCU\n"); __ast_write32(regs, 0x10000 + (r & 0x0000ffff), v); } -- 2.34.1 ^ permalink raw reply related [flat|nested] 4+ messages in thread
* Claude review: drm/ast: Add timeouts to AHB/SCU polling loops to prevent soft lockups 2026-05-13 11:39 ` [PATCH] drm/ast: Add timeouts to AHB/SCU polling loops to prevent soft lockups w15303746062 @ 2026-05-16 2:09 ` Claude Code Review Bot 0 siblings, 0 replies; 4+ messages in thread From: Claude Code Review Bot @ 2026-05-16 2:09 UTC (permalink / raw) To: dri-devel-reviews Overall Series Review Subject: drm/ast: Add timeouts to AHB/SCU polling loops to prevent soft lockups Author: <w15303746062@163.com> Patches: 1 Reviewed: 2026-05-16T12:09:52.113846 --- **NAK — Patch is against a stale tree and the underlying code has been refactored away.** This single patch targets two files (`ast_post.c` and `ast_2500.c`) to add timeouts to polling loops in `__ast_mindwm`, `__ast_moutdwm`, and `ast_2500_patch_ahb`. However, the code has been substantially refactored in the current drm-next tree: 1. **`__ast_mindwm` and `__ast_moutdwm` no longer contain polling loops.** They were moved from `ast_post.c` to `ast_drv.c` and rewritten to use `__ast_selseg` / `__ast_rdseg32` / `__ast_wrseg32` helpers. The old segment-selection polling loop (`do { data = __ast_read32(...) & 0xffff0000; } while (data != ...)`) now lives in `__ast_selseg()` at `ast_drv.c:63-67`. 2. **`ast_2500_patch_ahb` was also refactored** — it now uses symbolic register names (`AST_REG_SCU000`, `AST_REG_SCU000_PROTECTION_KEY`, etc.) instead of raw hex addresses (`0x1e6e2000`, `0x1688A8A8`). The patch's context lines reference the old raw-hex code and won't apply. 3. **There is also `ast_moutdwm_poll`** at `ast_drv.c:105-115` which has yet another unbounded polling loop using `cpu_relax()` — the patch doesn't address this one. The underlying concern (unbounded polling loops that can cause soft lockups) remains valid, but a new patch would need to target the current code structure, addressing loops in: - `__ast_selseg()` (`ast_drv.c:63-67`) - `ast_moutdwm_poll()` (`ast_drv.c:112-114`) - `ast_2500_patch_ahb()` (`ast_2500.c:136-139`) --- --- Generated by Claude Code Patch Reviewer ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2026-05-16 2:09 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-13 7:36 BUG: drm/ast: soft lockup due to missing timeout in hardware polling (ast_2500_patch_ahb) w15303746062 2026-05-13 7:50 ` Thomas Zimmermann 2026-05-13 11:39 ` [PATCH] drm/ast: Add timeouts to AHB/SCU polling loops to prevent soft lockups w15303746062 2026-05-16 2:09 ` Claude review: " Claude Code Review Bot
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox