public inbox for drm-ai-reviews@public-inbox.freedesktop.org
 help / color / mirror / Atom feed
From: Claude Code Review Bot <claude-review@example.com>
To: dri-devel-reviews@example.com
Subject: Claude review: accel/amdxdna: Fix runtime suspend deadlock when there is pending job
Date: Wed, 11 Mar 2026 13:06:24 +1000	[thread overview]
Message-ID: <review-patch1-20260310174936.335616-1-lizhi.hou@amd.com> (raw)
In-Reply-To: <20260310174936.335616-1-lizhi.hou@amd.com>

Patch Review

**PM reference leak on successful job submission:**

In the patched `amdxdna_cmd_submit()`, on success (line 492 in the original), the function returns 0 *without* calling `amdxdna_pm_suspend_put()`. The intent is that the matching `suspend_put` will be called later in `amdxdna_sched_job_cleanup()`. However, `amdxdna_sched_job_cleanup()` is only called from `aie2_job_release()` (via `kref_put`), and `kref_init(&job->refcnt)` is at line 477 in the current code. The `kref_get` in `aie2_sched_job_run()` increments the refcount, and `aie2_job_put` in the notify/error paths decrements it. On the success path, the flow is:

1. `amdxdna_cmd_submit` → `resume_get` (ref=1)
2. `aie2_sched_job_run` → `kref_get` (refcnt=2), then on success, returns fence
3. `aie2_sched_notify` → `aie2_job_put` (refcnt=1) — but **no longer calls `suspend_put`** since that line is removed
4. Eventually `aie2_job_release` → `amdxdna_sched_job_cleanup` → `suspend_put` (ref=0)

This looks correct — the `suspend_put` moved into cleanup fires once per job lifetime. Good.

**Missing `suspend_put` when `aie2_sched_job_run` returns NULL for `!mbox_chann`:**

```c
-	if (!hwctx->priv->mbox_chann) {
-		amdxdna_pm_suspend_put(hwctx->client->xdna);
-		return NULL;
-	}
+	if (!hwctx->priv->mbox_chann)
+		return NULL;
```

When this returns NULL, the DRM scheduler treats it as the job not needing a fence and the job is effectively dropped. In this path the PM reference was already acquired in `amdxdna_cmd_submit()`. Returning NULL here means the scheduler's `free_job` callback will eventually be called, which should lead to `aie2_job_release` → `amdxdna_sched_job_cleanup` → `suspend_put`. **This is fine** as long as the DRM scheduler does call free_job for jobs whose run callback returned NULL. This should be verified — if the scheduler doesn't call `free_job` in this case, the PM reference will leak.

**Error path in `amdxdna_cmd_submit` — `put_bos` label ordering:**

```c
unlock_srcu:
	srcu_read_unlock(&client->hwctx_srcu, idx);
+	amdxdna_pm_suspend_put(xdna);
+put_bos:
	amdxdna_arg_bos_put(job);
```

The `put_bos` label is placed *after* the `suspend_put` call. When the `amdxdna_arg_bos_lookup()` call fails (jumping to `put_bos`), `suspend_put` is **not** called. But `resume_get` was called just above in the patched code:

```c
ret = amdxdna_pm_resume_get(xdna);
if (ret) {
    XDNA_ERR(xdna, "Resume failed, ret %d", ret);
    goto put_bos;
}
```

Wait — if `resume_get` fails, the code jumps to `put_bos`, which skips `suspend_put`. That's **correct** because the resume failed so there's no reference to drop.

However, if `resume_get` succeeds but a later step (e.g., `hwctx` lookup, `fence_create`, `cmd_submit`) fails, the error path goes to `unlock_srcu`, which calls `suspend_put` before `put_bos`. That is also **correct**.

**Double `suspend_put` on `aie2_sched_job_run` error path:**

In the patched `aie2_sched_job_run`, the error path at the `out:` label no longer calls `suspend_put`:

```c
out:
	if (ret) {
-		amdxdna_pm_suspend_put(hwctx->client->xdna);
		dma_fence_put(job->fence);
		aie2_job_put(job);
```

`aie2_job_put` will eventually lead to `amdxdna_sched_job_cleanup` → `suspend_put`. This is correct — one `resume_get` at submit, one `suspend_put` at cleanup.

**Concern — `suspend_put` in `aie2_sched_notify` removed:**

```c
-	amdxdna_pm_suspend_put(job->hwctx->client->xdna);
 	job->hwctx->priv->completed++;
```

This is the normal success path. The `suspend_put` is now deferred to `amdxdna_sched_job_cleanup()` via the kref release. This changes the semantics: the device stays "resumed" slightly longer (until the job object is freed rather than when the job completes). This is functionally correct but means the device won't autosuspend as quickly. For a deadlock fix, this is an acceptable trade-off.

**Minor style note:** The commit message says `[PATCH V1]` — the `V1` is unnecessary for a first posting (it's the default).

**Summary:** The core logic of the fix is sound. The PM reference is taken at submit time (user context, safe to sleep) and released when the job is cleaned up. The key risk to verify is: does the DRM scheduler guarantee that `free_job` is called (and thus `aie2_job_release` triggers) for every job whose `run` callback returned NULL? If not, the `!mbox_chann` early-return path would leak a PM reference. A comment at that return site explaining why it's safe would be valuable.

---
Generated by Claude Code Patch Reviewer

  parent reply	other threads:[~2026-03-11  3:06 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-10 17:49 [PATCH V1] accel/amdxdna: Fix runtime suspend deadlock when there is pending job Lizhi Hou
2026-03-10 17:50 ` Mario Limonciello
2026-03-10 18:00   ` Lizhi Hou
2026-03-11  3:06 ` Claude review: " Claude Code Review Bot
2026-03-11  3:06 ` Claude Code Review Bot [this message]
  -- strict thread matches above, loose matches on Subject: below --
2026-03-10 18:00 [PATCH V2] " Lizhi Hou
2026-03-11  3:04 ` Claude review: " Claude Code Review Bot
2026-03-11  3:04 ` Claude Code Review Bot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=review-patch1-20260310174936.335616-1-lizhi.hou@amd.com \
    --to=claude-review@example.com \
    --cc=dri-devel-reviews@example.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox