[PATCH V1] accel/amdxdna: Check for device hang on job timeout

public inbox for drm-ai-reviews@public-inbox.freedesktop.org
 help / color / mirror / Atom feed

* [PATCH V1] accel/amdxdna: Check for device hang on job timeout
@ 2026-04-09 17:58 Lizhi Hou
  2026-04-10 20:37 ` Mario Limonciello
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Lizhi Hou @ 2026-04-09 17:58 UTC (permalink / raw)
  To: ogabbay, quic_jhugo, dri-devel, mario.limonciello,
	maciej.falkowski
  Cc: Lizhi Hou, linux-kernel, max.zhen, sonal.santan

A job timeout does not necessarily indicate that the device is hung, as
it may still be processing other jobs.

Track whether any jobs have been successfully submitted or completed,
and use this information to determine if the device is making forward
progress. If so, return DRM_GPU_SCHED_STAT_NO_HANG instead of treating
the timeout as a device hang.

In the meanwhile the timeout interval is changed to 2 seconds which meets
the userspace requirement.

Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
---
 drivers/accel/amdxdna/aie2_ctx.c | 36 +++++++++++++++++++++++++++-----
 drivers/accel/amdxdna/aie2_pci.h |  6 ++++++
 2 files changed, 37 insertions(+), 5 deletions(-)

diff --git a/drivers/accel/amdxdna/aie2_ctx.c b/drivers/accel/amdxdna/aie2_ctx.c
index f97755d60fa3..ddcf06a6b80c 100644
--- a/drivers/accel/amdxdna/aie2_ctx.c
+++ b/drivers/accel/amdxdna/aie2_ctx.c
@@ -27,7 +27,9 @@ static bool force_cmdlist = true;
 module_param(force_cmdlist, bool, 0600);
 MODULE_PARM_DESC(force_cmdlist, "Force use command list (Default true)");
 
-#define HWCTX_MAX_TIMEOUT	60000 /* milliseconds */
+uint tdr_timeout_ms = 2000;
+module_param(tdr_timeout_ms, int, 0400);
+MODULE_PARM_DESC(tdr_timeout_ms, "TDR (Timeout Detection and Recovery) timeout in milliseconds (0 = disable)");
 
 struct aie2_ctx_health {
 	struct amdxdna_ctx_health header;
@@ -39,6 +41,24 @@ struct aie2_ctx_health {
 	u32 fatal_error_app_module;
 };
 
+static inline void aie2_tdr_signal(struct amdxdna_dev *xdna)
+{
+	WRITE_ONCE(xdna->dev_handle->tdr_status, AIE2_TDR_SIGNALED);
+}
+
+static bool aie2_tdr_detect(struct amdxdna_dev *xdna)
+{
+	struct amdxdna_dev_hdl *ndev = xdna->dev_handle;
+
+	if (READ_ONCE(ndev->tdr_status) == AIE2_TDR_WAIT) {
+		XDNA_ERR(xdna, "TDR timeout detected");
+		return true;
+	}
+
+	WRITE_ONCE(ndev->tdr_status, AIE2_TDR_WAIT);
+	return false;
+}
+
 static void aie2_job_release(struct kref *ref)
 {
 	struct amdxdna_sched_job *job;
@@ -177,6 +197,7 @@ aie2_sched_notify(struct amdxdna_sched_job *job)
 
 	trace_xdna_job(&job->base, job->hwctx->name, "signaled fence", job->seq);
 
+	aie2_tdr_signal(job->hwctx->client->xdna);
 	job->hwctx->priv->completed++;
 	dma_fence_signal(fence);
 
@@ -385,6 +406,8 @@ aie2_sched_job_run(struct drm_sched_job *sched_job)
 		aie2_job_put(job);
 		mmput(job->mm);
 		fence = ERR_PTR(ret);
+	} else {
+		aie2_tdr_signal(hwctx->client->xdna);
 	}
 	trace_xdna_job(sched_job, hwctx->name, "sent to device", job->seq);
 
@@ -415,9 +438,12 @@ aie2_sched_job_timedout(struct drm_sched_job *sched_job)
 
 	xdna = hwctx->client->xdna;
 	trace_xdna_job(sched_job, hwctx->name, "job timedout", job->seq);
-	job->job_timeout = true;
 
-	mutex_lock(&xdna->dev_lock);
+	guard(mutex)(&xdna->dev_lock);
+
+	if (!aie2_tdr_detect(xdna))
+		return DRM_GPU_SCHED_STAT_NO_HANG;
+
 	report = kzalloc_obj(*report);
 	if (!report)
 		goto reset_hwctx;
@@ -429,10 +455,10 @@ aie2_sched_job_timedout(struct drm_sched_job *sched_job)
 		job->aie2_job_health = report;
 
 reset_hwctx:
+	job->job_timeout = true;
 	aie2_hwctx_stop(xdna, hwctx, sched_job);
 
 	aie2_hwctx_restart(xdna, hwctx);
-	mutex_unlock(&xdna->dev_lock);
 
 	return DRM_GPU_SCHED_STAT_RESET;
 }
@@ -608,7 +634,7 @@ int aie2_hwctx_init(struct amdxdna_hwctx *hwctx)
 		.ops = &sched_ops,
 		.num_rqs = DRM_SCHED_PRIORITY_COUNT,
 		.credit_limit = HWCTX_MAX_CMDS,
-		.timeout = msecs_to_jiffies(HWCTX_MAX_TIMEOUT),
+		.timeout = msecs_to_jiffies(tdr_timeout_ms),
 		.name = "amdxdna_js",
 		.dev = xdna->ddev.dev,
 	};
diff --git a/drivers/accel/amdxdna/aie2_pci.h b/drivers/accel/amdxdna/aie2_pci.h
index 7c308672b5fe..81564483cb16 100644
--- a/drivers/accel/amdxdna/aie2_pci.h
+++ b/drivers/accel/amdxdna/aie2_pci.h
@@ -165,6 +165,11 @@ struct aie2_exec_msg_ops {
 	u32 (*get_chain_msg_op)(u32 cmd_op);
 };
 
+enum aie2_tdr_status {
+	AIE2_TDR_WAIT,
+	AIE2_TDR_SIGNALED,
+};
+
 struct amdxdna_dev_hdl {
 	struct aie_device		aie;
 	const struct amdxdna_dev_priv	*priv;
@@ -197,6 +202,7 @@ struct amdxdna_dev_hdl {
 	u32				hwctx_num;
 
 	struct amdxdna_async_error	last_async_err;
+	enum aie2_tdr_status		tdr_status;
 };
 
 struct aie2_hw_ops {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH V1] accel/amdxdna: Check for device hang on job timeout
  2026-04-09 17:58 [PATCH V1] accel/amdxdna: Check for device hang on job timeout Lizhi Hou
@ 2026-04-10 20:37 ` Mario Limonciello
  2026-04-12  0:41 ` Claude review: " Claude Code Review Bot
  2026-04-12  0:41 ` Claude Code Review Bot
  2 siblings, 0 replies; 4+ messages in thread
From: Mario Limonciello @ 2026-04-10 20:37 UTC (permalink / raw)
  To: Lizhi Hou, ogabbay, quic_jhugo, dri-devel, maciej.falkowski
  Cc: linux-kernel, max.zhen, sonal.santan



On 4/9/26 12:58, Lizhi Hou wrote:
> A job timeout does not necessarily indicate that the device is hung, as
> it may still be processing other jobs.
> 
> Track whether any jobs have been successfully submitted or completed,
> and use this information to determine if the device is making forward
> progress. If so, return DRM_GPU_SCHED_STAT_NO_HANG instead of treating
> the timeout as a device hang.
> 
> In the meanwhile the timeout interval is changed to 2 seconds which meets
> the userspace requirement.
> 
> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>

> ---
>   drivers/accel/amdxdna/aie2_ctx.c | 36 +++++++++++++++++++++++++++-----
>   drivers/accel/amdxdna/aie2_pci.h |  6 ++++++
>   2 files changed, 37 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/accel/amdxdna/aie2_ctx.c b/drivers/accel/amdxdna/aie2_ctx.c
> index f97755d60fa3..ddcf06a6b80c 100644
> --- a/drivers/accel/amdxdna/aie2_ctx.c
> +++ b/drivers/accel/amdxdna/aie2_ctx.c
> @@ -27,7 +27,9 @@ static bool force_cmdlist = true;
>   module_param(force_cmdlist, bool, 0600);
>   MODULE_PARM_DESC(force_cmdlist, "Force use command list (Default true)");
>   
> -#define HWCTX_MAX_TIMEOUT	60000 /* milliseconds */
> +uint tdr_timeout_ms = 2000;
> +module_param(tdr_timeout_ms, int, 0400);
> +MODULE_PARM_DESC(tdr_timeout_ms, "TDR (Timeout Detection and Recovery) timeout in milliseconds (0 = disable)");
>   
>   struct aie2_ctx_health {
>   	struct amdxdna_ctx_health header;
> @@ -39,6 +41,24 @@ struct aie2_ctx_health {
>   	u32 fatal_error_app_module;
>   };
>   
> +static inline void aie2_tdr_signal(struct amdxdna_dev *xdna)
> +{
> +	WRITE_ONCE(xdna->dev_handle->tdr_status, AIE2_TDR_SIGNALED);
> +}
> +
> +static bool aie2_tdr_detect(struct amdxdna_dev *xdna)
> +{
> +	struct amdxdna_dev_hdl *ndev = xdna->dev_handle;
> +
> +	if (READ_ONCE(ndev->tdr_status) == AIE2_TDR_WAIT) {
> +		XDNA_ERR(xdna, "TDR timeout detected");
> +		return true;
> +	}
> +
> +	WRITE_ONCE(ndev->tdr_status, AIE2_TDR_WAIT);
> +	return false;
> +}
> +
>   static void aie2_job_release(struct kref *ref)
>   {
>   	struct amdxdna_sched_job *job;
> @@ -177,6 +197,7 @@ aie2_sched_notify(struct amdxdna_sched_job *job)
>   
>   	trace_xdna_job(&job->base, job->hwctx->name, "signaled fence", job->seq);
>   
> +	aie2_tdr_signal(job->hwctx->client->xdna);
>   	job->hwctx->priv->completed++;
>   	dma_fence_signal(fence);
>   
> @@ -385,6 +406,8 @@ aie2_sched_job_run(struct drm_sched_job *sched_job)
>   		aie2_job_put(job);
>   		mmput(job->mm);
>   		fence = ERR_PTR(ret);
> +	} else {
> +		aie2_tdr_signal(hwctx->client->xdna);
>   	}
>   	trace_xdna_job(sched_job, hwctx->name, "sent to device", job->seq);
>   
> @@ -415,9 +438,12 @@ aie2_sched_job_timedout(struct drm_sched_job *sched_job)
>   
>   	xdna = hwctx->client->xdna;
>   	trace_xdna_job(sched_job, hwctx->name, "job timedout", job->seq);
> -	job->job_timeout = true;
>   
> -	mutex_lock(&xdna->dev_lock);
> +	guard(mutex)(&xdna->dev_lock);
> +
> +	if (!aie2_tdr_detect(xdna))
> +		return DRM_GPU_SCHED_STAT_NO_HANG;
> +
>   	report = kzalloc_obj(*report);
>   	if (!report)
>   		goto reset_hwctx;
> @@ -429,10 +455,10 @@ aie2_sched_job_timedout(struct drm_sched_job *sched_job)
>   		job->aie2_job_health = report;
>   
>   reset_hwctx:
> +	job->job_timeout = true;
>   	aie2_hwctx_stop(xdna, hwctx, sched_job);
>   
>   	aie2_hwctx_restart(xdna, hwctx);
> -	mutex_unlock(&xdna->dev_lock);
>   
>   	return DRM_GPU_SCHED_STAT_RESET;
>   }
> @@ -608,7 +634,7 @@ int aie2_hwctx_init(struct amdxdna_hwctx *hwctx)
>   		.ops = &sched_ops,
>   		.num_rqs = DRM_SCHED_PRIORITY_COUNT,
>   		.credit_limit = HWCTX_MAX_CMDS,
> -		.timeout = msecs_to_jiffies(HWCTX_MAX_TIMEOUT),
> +		.timeout = msecs_to_jiffies(tdr_timeout_ms),
>   		.name = "amdxdna_js",
>   		.dev = xdna->ddev.dev,
>   	};
> diff --git a/drivers/accel/amdxdna/aie2_pci.h b/drivers/accel/amdxdna/aie2_pci.h
> index 7c308672b5fe..81564483cb16 100644
> --- a/drivers/accel/amdxdna/aie2_pci.h
> +++ b/drivers/accel/amdxdna/aie2_pci.h
> @@ -165,6 +165,11 @@ struct aie2_exec_msg_ops {
>   	u32 (*get_chain_msg_op)(u32 cmd_op);
>   };
>   
> +enum aie2_tdr_status {
> +	AIE2_TDR_WAIT,
> +	AIE2_TDR_SIGNALED,
> +};
> +
>   struct amdxdna_dev_hdl {
>   	struct aie_device		aie;
>   	const struct amdxdna_dev_priv	*priv;
> @@ -197,6 +202,7 @@ struct amdxdna_dev_hdl {
>   	u32				hwctx_num;
>   
>   	struct amdxdna_async_error	last_async_err;
> +	enum aie2_tdr_status		tdr_status;
>   };
>   
>   struct aie2_hw_ops {


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Claude review: accel/amdxdna: Check for device hang on job timeout
  2026-04-09 17:58 [PATCH V1] accel/amdxdna: Check for device hang on job timeout Lizhi Hou
  2026-04-10 20:37 ` Mario Limonciello
  2026-04-12  0:41 ` Claude review: " Claude Code Review Bot
@ 2026-04-12  0:41 ` Claude Code Review Bot
  2 siblings, 0 replies; 4+ messages in thread
From: Claude Code Review Bot @ 2026-04-12  0:41 UTC (permalink / raw)
  To: dri-devel-reviews

Overall Series Review

Subject: accel/amdxdna: Check for device hang on job timeout
Author: Lizhi Hou <lizhi.hou@amd.com>
Patches: 2
Reviewed: 2026-04-12T10:41:19.407531

---

This is a single patch that adds Timeout Detection and Recovery (TDR) forward-progress checking to the amdxdna accelerator driver. The core idea is sound: instead of treating every scheduler timeout as a device hang, the driver checks whether the device has made forward progress (submitted or completed any job) since the last timeout check. If it has, the timeout handler returns `DRM_GPU_SCHED_STAT_NO_HANG` rather than resetting the context.

However, the implementation has several significant issues that should be addressed before merging:

1. **Global-level tracking for per-context schedulers** — the `tdr_status` field lives on the global `amdxdna_dev_hdl`, but each hardware context has its own `drm_gpu_scheduler`. A job completing in context A will prevent context B from ever detecting a hang.
2. **Module parameter type mismatch** — declared `uint` but registered as `int`.
3. **Non-static global variable** pollutes the module namespace.
4. **Missing initialization** of `tdr_status`.
5. **The timeout change from 60s to 2s is smuggled into this patch** with minimal justification.

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Claude review: accel/amdxdna: Check for device hang on job timeout
  2026-04-09 17:58 [PATCH V1] accel/amdxdna: Check for device hang on job timeout Lizhi Hou
  2026-04-10 20:37 ` Mario Limonciello
@ 2026-04-12  0:41 ` Claude Code Review Bot
  2026-04-12  0:41 ` Claude Code Review Bot
  2 siblings, 0 replies; 4+ messages in thread
From: Claude Code Review Bot @ 2026-04-12  0:41 UTC (permalink / raw)
  To: dri-devel-reviews

Patch Review

**Module parameter type mismatch:**
```c
uint tdr_timeout_ms = 2000;
module_param(tdr_timeout_ms, int, 0400);
```
The variable is declared as `uint` but registered with `module_param()` as `int`. This is a type mismatch — it should be `module_param(tdr_timeout_ms, uint, 0400)`. Using `int` with a `uint` variable will cause the sysfs read to interpret the value incorrectly on some architectures and also means negative values could be written if permissions allowed it.

**Missing `static` keyword:**
```c
uint tdr_timeout_ms = 2000;
```
This global variable should be `static` to avoid polluting the kernel symbol namespace. All other module parameters in this file (e.g. `force_cmdlist` at the original line 26) are declared `static`.

**Global tdr_status for per-context schedulers — fundamental design flaw:**
The `tdr_status` field is placed on `struct amdxdna_dev_hdl` (a per-device structure), but each `amdxdna_hwctx` has its own `drm_gpu_scheduler` instance. The signal path:
```c
static inline void aie2_tdr_signal(struct amdxdna_dev *xdna)
{
    WRITE_ONCE(xdna->dev_handle->tdr_status, AIE2_TDR_SIGNALED);
}
```
is called from both `aie2_sched_notify()` (job completion) and `aie2_sched_job_run()` (job submission). Any job completing or being submitted on *any* context will set the global `tdr_status` to `AIE2_TDR_SIGNALED`. Then when a *different* context's job times out, `aie2_tdr_detect()` will see `AIE2_TDR_SIGNALED` and conclude forward progress was made — even though that specific context's job is truly hung. This means a busy device with many active contexts can never detect a per-context hang. The `tdr_status` should either be per-hwctx or per-scheduler, not per-device.

**Missing tdr_status initialization:**
The `tdr_status` field in `amdxdna_dev_hdl` is not explicitly initialized. Since the struct is allocated with `devm_kzalloc()`, it will be zero-initialized to `AIE2_TDR_WAIT` (value 0), which means the very first timeout will be treated as a hang regardless of whether jobs have completed. While it works out since `AIE2_TDR_WAIT = 0`, relying on this implicit initialization is fragile — if someone reorders the enum, the behavior silently changes. It would be clearer to explicitly initialize it to `AIE2_TDR_SIGNALED` during device init so the first timeout gets the benefit of a forward-progress check.

**Race condition in the forward-progress protocol:**
The protocol is: `aie2_tdr_detect()` sets status to `AIE2_TDR_WAIT`, then on the next timeout checks if it's still `AIE2_TDR_WAIT`. But the signal side (`aie2_tdr_signal`) uses only `WRITE_ONCE` with no memory barrier pairing. The `READ_ONCE`/`WRITE_ONCE` in `aie2_tdr_detect` ensure no tearing, but there's no acquire/release ordering to guarantee that the `WRITE_ONCE(... AIE2_TDR_WAIT)` in `aie2_tdr_detect` is visible to the CPU running `aie2_tdr_signal` before the next job submission/completion. In practice this is likely fine on x86 (TSO), but is not portable — a `smp_mb()` or use of `smp_store_release()`/`smp_load_acquire()` would be correct.

**Timeout value change buried in a functionality patch:**
The patch silently changes the scheduler timeout from 60 seconds (`HWCTX_MAX_TIMEOUT`) to 2 seconds (`tdr_timeout_ms = 2000`). The commit message says "the timeout interval is changed to 2 seconds which meets the userspace requirement" but this is a 30x reduction that deserves more explanation. A 2-second timeout with the forward-progress mechanism means the driver will poll via timeouts very aggressively, which could increase overhead. The relationship between this new timeout value and the forward-progress mechanism should be clearly documented — is 2s chosen because the device is expected to complete at least one job every 2s when healthy?

**Moved `job->job_timeout = true` changes semantics:**
In the original code:
```c
job->job_timeout = true;
// ...
mutex_lock(&xdna->dev_lock);
```
The patch moves this to after the forward-progress check:
```c
guard(mutex)(&xdna->dev_lock);

if (!aie2_tdr_detect(xdna))
    return DRM_GPU_SCHED_STAT_NO_HANG;
// ...
reset_hwctx:
    job->job_timeout = true;
```
This means that when `DRM_GPU_SCHED_STAT_NO_HANG` is returned, `job->job_timeout` remains `false`. This is correct behavior (the job hasn't actually timed out), but worth verifying that no other code path checks `job_timeout` in a way that depends on the old behavior.

**guard(mutex) conversion is mixed into the functional change:**
The patch converts from explicit `mutex_lock()`/`mutex_unlock()` to `guard(mutex)()`. While this is a good cleanup, mixing mechanical refactoring with behavioral changes makes review harder. The early return path (`return DRM_GPU_SCHED_STAT_NO_HANG`) does benefit from the guard-based scoping, so it's somewhat justified, but separating it would be cleaner.

**tdr_timeout_ms = 0 disables timeout but is not validated:**
The `MODULE_PARM_DESC` says `(0 = disable)`, and `msecs_to_jiffies(0)` would set the scheduler timeout to 0, which for `drm_sched_init` means `MAX_SCHEDULE_TIMEOUT` (infinite timeout). This should be documented more carefully — with timeout disabled, the TDR mechanism becomes useless and a truly hung device will never be detected.

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-04-12  0:41 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-09 17:58 [PATCH V1] accel/amdxdna: Check for device hang on job timeout Lizhi Hou
2026-04-10 20:37 ` Mario Limonciello
2026-04-12  0:41 ` Claude review: " Claude Code Review Bot
2026-04-12  0:41 ` Claude Code Review Bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox