From mboxrd@z Thu Jan 1 00:00:00 1970 From: Claude Code Review Bot To: dri-devel-reviews@example.com Subject: Claude review: accel/amdxdna: Fix runtime suspend deadlock when there is pending job Date: Wed, 11 Mar 2026 13:04:35 +1000 Message-ID: In-Reply-To: <20260310180058.336348-1-lizhi.hou@amd.com> References: <20260310180058.336348-1-lizhi.hou@amd.com> <20260310180058.336348-1-lizhi.hou@amd.com> X-Mailer: Claude Code Patch Reviewer Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Patch Review **Deadlock analysis is correct.** The described scenario =E2=80=94 suspend = path waiting for job completion while job execution path waits for PM resum= e =E2=80=94 is a real deadlock and this approach of moving PM management to= submit/cleanup boundaries is the right structural fix. **PM reference imbalance on error paths in `aie2_sched_job_run()`:** In the original code, when `aie2_sched_job_run()` succeeds, the PM referenc= e acquired at the start is released later by `aie2_sched_notify()`. With th= is patch, the `amdxdna_pm_suspend_put()` call is removed from `aie2_sched_n= otify()` (line 168 in the current tree) and instead placed in `amdxdna_sche= d_job_cleanup()`, which runs when the job refcount reaches zero via `aie2_j= ob_release()`. However, looking at the error path in `aie2_sched_job_run()`: ```c out: if (ret) { - amdxdna_pm_suspend_put(hwctx->client->xdna); dma_fence_put(job->fence); aie2_job_put(job); /* drops refcnt, may call cleanup */ mmput(job->mm); fence =3D ERR_PTR(ret); } ``` When `aie2_sched_job_run()` fails (ret !=3D 0), `aie2_job_put()` will decre= ment the refcount. If this drops it to zero, `aie2_job_release()` =E2=86=92= `amdxdna_sched_job_cleanup()` will call `amdxdna_pm_suspend_put()`. But th= e refcount was incremented by `kref_get()` earlier in `aie2_sched_job_run()= ` (at the line `kref_get(&job->refcnt)`), and the initial refcount was set = at submit time via `kref_init()`. So `aie2_job_put()` here drops the extra = ref taken in `run`, leaving the initial ref still held. The job cleanup (wi= th the PM put) won't happen until the DRM scheduler framework drops the fin= al reference. This should be fine for balance =E2=80=94 the PM get happened= at submit, the PM put happens at final cleanup. **But consider this error path:** When `aie2_sched_job_run()` returns `NULL= ` (not an error pointer): ```c if (!hwctx->priv->mbox_chann) return NULL; ``` Returning `NULL` from `run` means the scheduler treats it as "no fence" but= not an error. The job won't go through the normal completion path (`aie2_s= ched_notify()`), but the PM reference taken at submit time will only be rel= eased when the job is eventually freed through cleanup. This seems acceptab= le since the scheduler will still call `free_job` eventually. **The early `return NULL` for PM resume failure is removed but needs attent= ion:** In the original code: ```c ret =3D amdxdna_pm_resume_get(hwctx->client->xdna); if (ret) return NULL; ``` Returning `NULL` here was already questionable (no PM reference to worry ab= out since get failed), and now it's simply removed since PM is handled at s= ubmit time. This is correct. **Error path label ordering in `amdxdna_cmd_submit()`:** ```c + ret =3D amdxdna_pm_resume_get(xdna); + if (ret) { + XDNA_ERR(xdna, "Resume failed, ret %d", ret); + goto put_bos; + } ... unlock_srcu: srcu_read_unlock(&client->hwctx_srcu, idx); + amdxdna_pm_suspend_put(xdna); +put_bos: amdxdna_arg_bos_put(job); ``` The error unwinding looks correct: on success the PM reference is held (rel= eased later in `amdxdna_sched_job_cleanup()`); on failure after PM resume, = `amdxdna_pm_suspend_put()` is called before falling through to `put_bos`. I= f PM resume itself fails, we jump directly to `put_bos`, skipping the `susp= end_put` =E2=80=94 correct, since the `resume_get` failed. **Potential concern =E2=80=94 PM reference held too long:** Previously the = PM reference was held from `run` to `notify` (execution window). Now it's h= eld from `submit` to job cleanup (entire job lifetime including time waitin= g in the scheduler queue). This means the autosuspend timer effectively can= never fire if there are queued jobs, even if the hardware is idle. For a d= evice with a 5-second autosuspend delay (`AMDXDNA_AUTOSUSPEND_DELAY`), this= could keep the device awake longer than necessary. This is a tradeoff for = correctness, which is reasonable, but worth noting. **`use after put` concern in error path:** In `aie2_sched_job_run()` error = path: ```c if (ret) { dma_fence_put(job->fence); aie2_job_put(job); mmput(job->mm); /* <-- accessing job->mm after potential free */ fence =3D ERR_PTR(ret); } ``` This is a pre-existing issue (not introduced by this patch), but `aie2_job_= put()` could theoretically free `job` if the refcount drops to zero, making= the subsequent `mmput(job->mm)` a use-after-free. In practice the schedule= r likely holds an additional reference, but it's worth noting. **Summary:** The fix is correct for the stated deadlock. The PM reference l= ifetime change (submit-to-cleanup vs run-to-notify) is a reasonable tradeof= f. The patch is clean, well-described, and the error paths in the new code = are properly handled. --- Generated by Claude Code Patch Reviewer