From mboxrd@z Thu Jan  1 00:00:00 1970
From: Claude Code Review Bot <claude-review@example.com>
To: dri-devel-reviews@example.com
Subject: Claude review: accel/amdxdna: Check for device hang on job timeout
Date: Sun, 12 Apr 2026 10:41:19 +1000
Message-ID: <review-patch1-20260409175826.195665-1-lizhi.hou@amd.com>
In-Reply-To: <20260409175826.195665-1-lizhi.hou@amd.com>
References: <20260409175826.195665-1-lizhi.hou@amd.com> <20260409175826.195665-1-lizhi.hou@amd.com>
X-Mailer: Claude Code Patch Reviewer
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Patch Review

**Module parameter type mismatch:**
```c
uint tdr_timeout_ms =3D 2000;
module_param(tdr_timeout_ms, int, 0400);
```
The variable is declared as `uint` but registered with `module_param()` as =
`int`. This is a type mismatch =E2=80=94 it should be `module_param(tdr_tim=
eout_ms, uint, 0400)`. Using `int` with a `uint` variable will cause the sy=
sfs read to interpret the value incorrectly on some architectures and also =
means negative values could be written if permissions allowed it.

**Missing `static` keyword:**
```c
uint tdr_timeout_ms =3D 2000;
```
This global variable should be `static` to avoid polluting the kernel symbo=
l namespace. All other module parameters in this file (e.g. `force_cmdlist`=
 at the original line 26) are declared `static`.

**Global tdr_status for per-context schedulers =E2=80=94 fundamental design=
 flaw:**
The `tdr_status` field is placed on `struct amdxdna_dev_hdl` (a per-device =
structure), but each `amdxdna_hwctx` has its own `drm_gpu_scheduler` instan=
ce. The signal path:
```c
static inline void aie2_tdr_signal(struct amdxdna_dev *xdna)
{
    WRITE_ONCE(xdna->dev_handle->tdr_status, AIE2_TDR_SIGNALED);
}
```
is called from both `aie2_sched_notify()` (job completion) and `aie2_sched_=
job_run()` (job submission). Any job completing or being submitted on *any*=
 context will set the global `tdr_status` to `AIE2_TDR_SIGNALED`. Then when=
 a *different* context's job times out, `aie2_tdr_detect()` will see `AIE2_=
TDR_SIGNALED` and conclude forward progress was made =E2=80=94 even though =
that specific context's job is truly hung. This means a busy device with ma=
ny active contexts can never detect a per-context hang. The `tdr_status` sh=
ould either be per-hwctx or per-scheduler, not per-device.

**Missing tdr_status initialization:**
The `tdr_status` field in `amdxdna_dev_hdl` is not explicitly initialized. =
Since the struct is allocated with `devm_kzalloc()`, it will be zero-initia=
lized to `AIE2_TDR_WAIT` (value 0), which means the very first timeout will=
 be treated as a hang regardless of whether jobs have completed. While it w=
orks out since `AIE2_TDR_WAIT =3D 0`, relying on this implicit initializati=
on is fragile =E2=80=94 if someone reorders the enum, the behavior silently=
 changes. It would be clearer to explicitly initialize it to `AIE2_TDR_SIGN=
ALED` during device init so the first timeout gets the benefit of a forward=
-progress check.

**Race condition in the forward-progress protocol:**
The protocol is: `aie2_tdr_detect()` sets status to `AIE2_TDR_WAIT`, then o=
n the next timeout checks if it's still `AIE2_TDR_WAIT`. But the signal sid=
e (`aie2_tdr_signal`) uses only `WRITE_ONCE` with no memory barrier pairing=
. The `READ_ONCE`/`WRITE_ONCE` in `aie2_tdr_detect` ensure no tearing, but =
there's no acquire/release ordering to guarantee that the `WRITE_ONCE(... A=
IE2_TDR_WAIT)` in `aie2_tdr_detect` is visible to the CPU running `aie2_tdr=
_signal` before the next job submission/completion. In practice this is lik=
ely fine on x86 (TSO), but is not portable =E2=80=94 a `smp_mb()` or use of=
 `smp_store_release()`/`smp_load_acquire()` would be correct.

**Timeout value change buried in a functionality patch:**
The patch silently changes the scheduler timeout from 60 seconds (`HWCTX_MA=
X_TIMEOUT`) to 2 seconds (`tdr_timeout_ms =3D 2000`). The commit message sa=
ys "the timeout interval is changed to 2 seconds which meets the userspace =
requirement" but this is a 30x reduction that deserves more explanation. A =
2-second timeout with the forward-progress mechanism means the driver will =
poll via timeouts very aggressively, which could increase overhead. The rel=
ationship between this new timeout value and the forward-progress mechanism=
 should be clearly documented =E2=80=94 is 2s chosen because the device is =
expected to complete at least one job every 2s when healthy?

**Moved `job->job_timeout =3D true` changes semantics:**
In the original code:
```c
job->job_timeout =3D true;
// ...
mutex_lock(&xdna->dev_lock);
```
The patch moves this to after the forward-progress check:
```c
guard(mutex)(&xdna->dev_lock);

if (!aie2_tdr_detect(xdna))
    return DRM_GPU_SCHED_STAT_NO_HANG;
// ...
reset_hwctx:
    job->job_timeout =3D true;
```
This means that when `DRM_GPU_SCHED_STAT_NO_HANG` is returned, `job->job_ti=
meout` remains `false`. This is correct behavior (the job hasn't actually t=
imed out), but worth verifying that no other code path checks `job_timeout`=
 in a way that depends on the old behavior.

**guard(mutex) conversion is mixed into the functional change:**
The patch converts from explicit `mutex_lock()`/`mutex_unlock()` to `guard(=
mutex)()`. While this is a good cleanup, mixing mechanical refactoring with=
 behavioral changes makes review harder. The early return path (`return DRM=
_GPU_SCHED_STAT_NO_HANG`) does benefit from the guard-based scoping, so it'=
s somewhat justified, but separating it would be cleaner.

**tdr_timeout_ms =3D 0 disables timeout but is not validated:**
The `MODULE_PARM_DESC` says `(0 =3D disable)`, and `msecs_to_jiffies(0)` wo=
uld set the scheduler timeout to 0, which for `drm_sched_init` means `MAX_S=
CHEDULE_TIMEOUT` (infinite timeout). This should be documented more careful=
ly =E2=80=94 with timeout disabled, the TDR mechanism becomes useless and a=
 truly hung device will never be detected.

---
Generated by Claude Code Patch Reviewer