From: Claude Code Review Bot <claude-review@example.com>
To: dri-devel-reviews@example.com
Subject: Claude review: accel/amdxdna: Fix runtime suspend deadlock when there is pending job
Date: Wed, 11 Mar 2026 13:04:35 +1000 [thread overview]
Message-ID: <review-patch1-20260310180058.336348-1-lizhi.hou@amd.com> (raw)
In-Reply-To: <20260310180058.336348-1-lizhi.hou@amd.com>
Patch Review
**Deadlock analysis is correct.** The described scenario — suspend path waiting for job completion while job execution path waits for PM resume — is a real deadlock and this approach of moving PM management to submit/cleanup boundaries is the right structural fix.
**PM reference imbalance on error paths in `aie2_sched_job_run()`:**
In the original code, when `aie2_sched_job_run()` succeeds, the PM reference acquired at the start is released later by `aie2_sched_notify()`. With this patch, the `amdxdna_pm_suspend_put()` call is removed from `aie2_sched_notify()` (line 168 in the current tree) and instead placed in `amdxdna_sched_job_cleanup()`, which runs when the job refcount reaches zero via `aie2_job_release()`.
However, looking at the error path in `aie2_sched_job_run()`:
```c
out:
if (ret) {
- amdxdna_pm_suspend_put(hwctx->client->xdna);
dma_fence_put(job->fence);
aie2_job_put(job); /* drops refcnt, may call cleanup */
mmput(job->mm);
fence = ERR_PTR(ret);
}
```
When `aie2_sched_job_run()` fails (ret != 0), `aie2_job_put()` will decrement the refcount. If this drops it to zero, `aie2_job_release()` → `amdxdna_sched_job_cleanup()` will call `amdxdna_pm_suspend_put()`. But the refcount was incremented by `kref_get()` earlier in `aie2_sched_job_run()` (at the line `kref_get(&job->refcnt)`), and the initial refcount was set at submit time via `kref_init()`. So `aie2_job_put()` here drops the extra ref taken in `run`, leaving the initial ref still held. The job cleanup (with the PM put) won't happen until the DRM scheduler framework drops the final reference. This should be fine for balance — the PM get happened at submit, the PM put happens at final cleanup.
**But consider this error path:** When `aie2_sched_job_run()` returns `NULL` (not an error pointer):
```c
if (!hwctx->priv->mbox_chann)
return NULL;
```
Returning `NULL` from `run` means the scheduler treats it as "no fence" but not an error. The job won't go through the normal completion path (`aie2_sched_notify()`), but the PM reference taken at submit time will only be released when the job is eventually freed through cleanup. This seems acceptable since the scheduler will still call `free_job` eventually.
**The early `return NULL` for PM resume failure is removed but needs attention:**
In the original code:
```c
ret = amdxdna_pm_resume_get(hwctx->client->xdna);
if (ret)
return NULL;
```
Returning `NULL` here was already questionable (no PM reference to worry about since get failed), and now it's simply removed since PM is handled at submit time. This is correct.
**Error path label ordering in `amdxdna_cmd_submit()`:**
```c
+ ret = amdxdna_pm_resume_get(xdna);
+ if (ret) {
+ XDNA_ERR(xdna, "Resume failed, ret %d", ret);
+ goto put_bos;
+ }
...
unlock_srcu:
srcu_read_unlock(&client->hwctx_srcu, idx);
+ amdxdna_pm_suspend_put(xdna);
+put_bos:
amdxdna_arg_bos_put(job);
```
The error unwinding looks correct: on success the PM reference is held (released later in `amdxdna_sched_job_cleanup()`); on failure after PM resume, `amdxdna_pm_suspend_put()` is called before falling through to `put_bos`. If PM resume itself fails, we jump directly to `put_bos`, skipping the `suspend_put` — correct, since the `resume_get` failed.
**Potential concern — PM reference held too long:** Previously the PM reference was held from `run` to `notify` (execution window). Now it's held from `submit` to job cleanup (entire job lifetime including time waiting in the scheduler queue). This means the autosuspend timer effectively can never fire if there are queued jobs, even if the hardware is idle. For a device with a 5-second autosuspend delay (`AMDXDNA_AUTOSUSPEND_DELAY`), this could keep the device awake longer than necessary. This is a tradeoff for correctness, which is reasonable, but worth noting.
**`use after put` concern in error path:** In `aie2_sched_job_run()` error path:
```c
if (ret) {
dma_fence_put(job->fence);
aie2_job_put(job);
mmput(job->mm); /* <-- accessing job->mm after potential free */
fence = ERR_PTR(ret);
}
```
This is a pre-existing issue (not introduced by this patch), but `aie2_job_put()` could theoretically free `job` if the refcount drops to zero, making the subsequent `mmput(job->mm)` a use-after-free. In practice the scheduler likely holds an additional reference, but it's worth noting.
**Summary:** The fix is correct for the stated deadlock. The PM reference lifetime change (submit-to-cleanup vs run-to-notify) is a reasonable tradeoff. The patch is clean, well-described, and the error paths in the new code are properly handled.
---
Generated by Claude Code Patch Reviewer
next prev parent reply other threads:[~2026-03-11 3:04 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-10 18:00 [PATCH V2] accel/amdxdna: Fix runtime suspend deadlock when there is pending job Lizhi Hou
2026-03-10 18:33 ` Mario Limonciello
2026-03-10 18:52 ` Lizhi Hou
2026-03-11 3:04 ` Claude Code Review Bot [this message]
2026-03-11 3:04 ` Claude review: " Claude Code Review Bot
-- strict thread matches above, loose matches on Subject: below --
2026-03-10 17:49 [PATCH V1] " Lizhi Hou
2026-03-11 3:06 ` Claude review: " Claude Code Review Bot
2026-03-11 3:06 ` Claude Code Review Bot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=review-patch1-20260310180058.336348-1-lizhi.hou@amd.com \
--to=claude-review@example.com \
--cc=dri-devel-reviews@example.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox