From mboxrd@z Thu Jan 1 00:00:00 1970 From: Claude Code Review Bot To: dri-devel-reviews@example.com Subject: Claude review: accel/amdxdna: Fix runtime suspend deadlock when there is pending job Date: Wed, 11 Mar 2026 13:06:24 +1000 Message-ID: In-Reply-To: <20260310174936.335616-1-lizhi.hou@amd.com> References: <20260310174936.335616-1-lizhi.hou@amd.com> <20260310174936.335616-1-lizhi.hou@amd.com> X-Mailer: Claude Code Patch Reviewer Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Patch Review **PM reference leak on successful job submission:** In the patched `amdxdna_cmd_submit()`, on success (line 492 in the original= ), the function returns 0 *without* calling `amdxdna_pm_suspend_put()`. The= intent is that the matching `suspend_put` will be called later in `amdxdna= _sched_job_cleanup()`. However, `amdxdna_sched_job_cleanup()` is only calle= d from `aie2_job_release()` (via `kref_put`), and `kref_init(&job->refcnt)`= is at line 477 in the current code. The `kref_get` in `aie2_sched_job_run(= )` increments the refcount, and `aie2_job_put` in the notify/error paths de= crements it. On the success path, the flow is: 1. `amdxdna_cmd_submit` =E2=86=92 `resume_get` (ref=3D1) 2. `aie2_sched_job_run` =E2=86=92 `kref_get` (refcnt=3D2), then on success,= returns fence 3. `aie2_sched_notify` =E2=86=92 `aie2_job_put` (refcnt=3D1) =E2=80=94 but = **no longer calls `suspend_put`** since that line is removed 4. Eventually `aie2_job_release` =E2=86=92 `amdxdna_sched_job_cleanup` =E2= =86=92 `suspend_put` (ref=3D0) This looks correct =E2=80=94 the `suspend_put` moved into cleanup fires onc= e per job lifetime. Good. **Missing `suspend_put` when `aie2_sched_job_run` returns NULL for `!mbox_c= hann`:** ```c - if (!hwctx->priv->mbox_chann) { - amdxdna_pm_suspend_put(hwctx->client->xdna); - return NULL; - } + if (!hwctx->priv->mbox_chann) + return NULL; ``` When this returns NULL, the DRM scheduler treats it as the job not needing = a fence and the job is effectively dropped. In this path the PM reference w= as already acquired in `amdxdna_cmd_submit()`. Returning NULL here means th= e scheduler's `free_job` callback will eventually be called, which should l= ead to `aie2_job_release` =E2=86=92 `amdxdna_sched_job_cleanup` =E2=86=92 `= suspend_put`. **This is fine** as long as the DRM scheduler does call free_= job for jobs whose run callback returned NULL. This should be verified =E2= =80=94 if the scheduler doesn't call `free_job` in this case, the PM refere= nce will leak. **Error path in `amdxdna_cmd_submit` =E2=80=94 `put_bos` label ordering:** ```c unlock_srcu: srcu_read_unlock(&client->hwctx_srcu, idx); + amdxdna_pm_suspend_put(xdna); +put_bos: amdxdna_arg_bos_put(job); ``` The `put_bos` label is placed *after* the `suspend_put` call. When the `amd= xdna_arg_bos_lookup()` call fails (jumping to `put_bos`), `suspend_put` is = **not** called. But `resume_get` was called just above in the patched code: ```c ret =3D amdxdna_pm_resume_get(xdna); if (ret) { XDNA_ERR(xdna, "Resume failed, ret %d", ret); goto put_bos; } ``` Wait =E2=80=94 if `resume_get` fails, the code jumps to `put_bos`, which sk= ips `suspend_put`. That's **correct** because the resume failed so there's = no reference to drop. However, if `resume_get` succeeds but a later step (e.g., `hwctx` lookup, `= fence_create`, `cmd_submit`) fails, the error path goes to `unlock_srcu`, w= hich calls `suspend_put` before `put_bos`. That is also **correct**. **Double `suspend_put` on `aie2_sched_job_run` error path:** In the patched `aie2_sched_job_run`, the error path at the `out:` label no = longer calls `suspend_put`: ```c out: if (ret) { - amdxdna_pm_suspend_put(hwctx->client->xdna); dma_fence_put(job->fence); aie2_job_put(job); ``` `aie2_job_put` will eventually lead to `amdxdna_sched_job_cleanup` =E2=86= =92 `suspend_put`. This is correct =E2=80=94 one `resume_get` at submit, on= e `suspend_put` at cleanup. **Concern =E2=80=94 `suspend_put` in `aie2_sched_notify` removed:** ```c - amdxdna_pm_suspend_put(job->hwctx->client->xdna); job->hwctx->priv->completed++; ``` This is the normal success path. The `suspend_put` is now deferred to `amdx= dna_sched_job_cleanup()` via the kref release. This changes the semantics: = the device stays "resumed" slightly longer (until the job object is freed r= ather than when the job completes). This is functionally correct but means = the device won't autosuspend as quickly. For a deadlock fix, this is an acc= eptable trade-off. **Minor style note:** The commit message says `[PATCH V1]` =E2=80=94 the `V= 1` is unnecessary for a first posting (it's the default). **Summary:** The core logic of the fix is sound. The PM reference is taken = at submit time (user context, safe to sleep) and released when the job is c= leaned up. The key risk to verify is: does the DRM scheduler guarantee that= `free_job` is called (and thus `aie2_job_release` triggers) for every job = whose `run` callback returned NULL? If not, the `!mbox_chann` early-return = path would leak a PM reference. A comment at that return site explaining wh= y it's safe would be valuable. --- Generated by Claude Code Patch Reviewer