From mboxrd@z Thu Jan 1 00:00:00 1970 From: Claude Code Review Bot To: dri-devel-reviews@example.com Subject: Claude review: accel/amdxdna: Check for device hang on job timeout Date: Sun, 12 Apr 2026 10:41:19 +1000 Message-ID: In-Reply-To: <20260409175826.195665-1-lizhi.hou@amd.com> References: <20260409175826.195665-1-lizhi.hou@amd.com> <20260409175826.195665-1-lizhi.hou@amd.com> X-Mailer: Claude Code Patch Reviewer Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Patch Review **Module parameter type mismatch:** ```c uint tdr_timeout_ms =3D 2000; module_param(tdr_timeout_ms, int, 0400); ``` The variable is declared as `uint` but registered with `module_param()` as = `int`. This is a type mismatch =E2=80=94 it should be `module_param(tdr_tim= eout_ms, uint, 0400)`. Using `int` with a `uint` variable will cause the sy= sfs read to interpret the value incorrectly on some architectures and also = means negative values could be written if permissions allowed it. **Missing `static` keyword:** ```c uint tdr_timeout_ms =3D 2000; ``` This global variable should be `static` to avoid polluting the kernel symbo= l namespace. All other module parameters in this file (e.g. `force_cmdlist`= at the original line 26) are declared `static`. **Global tdr_status for per-context schedulers =E2=80=94 fundamental design= flaw:** The `tdr_status` field is placed on `struct amdxdna_dev_hdl` (a per-device = structure), but each `amdxdna_hwctx` has its own `drm_gpu_scheduler` instan= ce. The signal path: ```c static inline void aie2_tdr_signal(struct amdxdna_dev *xdna) { WRITE_ONCE(xdna->dev_handle->tdr_status, AIE2_TDR_SIGNALED); } ``` is called from both `aie2_sched_notify()` (job completion) and `aie2_sched_= job_run()` (job submission). Any job completing or being submitted on *any*= context will set the global `tdr_status` to `AIE2_TDR_SIGNALED`. Then when= a *different* context's job times out, `aie2_tdr_detect()` will see `AIE2_= TDR_SIGNALED` and conclude forward progress was made =E2=80=94 even though = that specific context's job is truly hung. This means a busy device with ma= ny active contexts can never detect a per-context hang. The `tdr_status` sh= ould either be per-hwctx or per-scheduler, not per-device. **Missing tdr_status initialization:** The `tdr_status` field in `amdxdna_dev_hdl` is not explicitly initialized. = Since the struct is allocated with `devm_kzalloc()`, it will be zero-initia= lized to `AIE2_TDR_WAIT` (value 0), which means the very first timeout will= be treated as a hang regardless of whether jobs have completed. While it w= orks out since `AIE2_TDR_WAIT =3D 0`, relying on this implicit initializati= on is fragile =E2=80=94 if someone reorders the enum, the behavior silently= changes. It would be clearer to explicitly initialize it to `AIE2_TDR_SIGN= ALED` during device init so the first timeout gets the benefit of a forward= -progress check. **Race condition in the forward-progress protocol:** The protocol is: `aie2_tdr_detect()` sets status to `AIE2_TDR_WAIT`, then o= n the next timeout checks if it's still `AIE2_TDR_WAIT`. But the signal sid= e (`aie2_tdr_signal`) uses only `WRITE_ONCE` with no memory barrier pairing= . The `READ_ONCE`/`WRITE_ONCE` in `aie2_tdr_detect` ensure no tearing, but = there's no acquire/release ordering to guarantee that the `WRITE_ONCE(... A= IE2_TDR_WAIT)` in `aie2_tdr_detect` is visible to the CPU running `aie2_tdr= _signal` before the next job submission/completion. In practice this is lik= ely fine on x86 (TSO), but is not portable =E2=80=94 a `smp_mb()` or use of= `smp_store_release()`/`smp_load_acquire()` would be correct. **Timeout value change buried in a functionality patch:** The patch silently changes the scheduler timeout from 60 seconds (`HWCTX_MA= X_TIMEOUT`) to 2 seconds (`tdr_timeout_ms =3D 2000`). The commit message sa= ys "the timeout interval is changed to 2 seconds which meets the userspace = requirement" but this is a 30x reduction that deserves more explanation. A = 2-second timeout with the forward-progress mechanism means the driver will = poll via timeouts very aggressively, which could increase overhead. The rel= ationship between this new timeout value and the forward-progress mechanism= should be clearly documented =E2=80=94 is 2s chosen because the device is = expected to complete at least one job every 2s when healthy? **Moved `job->job_timeout =3D true` changes semantics:** In the original code: ```c job->job_timeout =3D true; // ... mutex_lock(&xdna->dev_lock); ``` The patch moves this to after the forward-progress check: ```c guard(mutex)(&xdna->dev_lock); if (!aie2_tdr_detect(xdna)) return DRM_GPU_SCHED_STAT_NO_HANG; // ... reset_hwctx: job->job_timeout =3D true; ``` This means that when `DRM_GPU_SCHED_STAT_NO_HANG` is returned, `job->job_ti= meout` remains `false`. This is correct behavior (the job hasn't actually t= imed out), but worth verifying that no other code path checks `job_timeout`= in a way that depends on the old behavior. **guard(mutex) conversion is mixed into the functional change:** The patch converts from explicit `mutex_lock()`/`mutex_unlock()` to `guard(= mutex)()`. While this is a good cleanup, mixing mechanical refactoring with= behavioral changes makes review harder. The early return path (`return DRM= _GPU_SCHED_STAT_NO_HANG`) does benefit from the guard-based scoping, so it'= s somewhat justified, but separating it would be cleaner. **tdr_timeout_ms =3D 0 disables timeout but is not validated:** The `MODULE_PARM_DESC` says `(0 =3D disable)`, and `msecs_to_jiffies(0)` wo= uld set the scheduler timeout to 0, which for `drm_sched_init` means `MAX_S= CHEDULE_TIMEOUT` (infinite timeout). This should be documented more careful= ly =E2=80=94 with timeout disabled, the TDR mechanism becomes useless and a= truly hung device will never be detected. --- Generated by Claude Code Patch Reviewer