From mboxrd@z Thu Jan 1 00:00:00 1970 From: Claude Code Review Bot To: dri-devel-reviews@example.com Subject: Claude review: accel/amdxdna: Check for device hang on job timeout Date: Sun, 12 Apr 2026 10:41:19 +1000 Message-ID: In-Reply-To: <20260409175826.195665-1-lizhi.hou@amd.com> References: <20260409175826.195665-1-lizhi.hou@amd.com> X-Mailer: Claude Code Patch Reviewer Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Overall Series Review Subject: accel/amdxdna: Check for device hang on job timeout Author: Lizhi Hou Patches: 2 Reviewed: 2026-04-12T10:41:19.407531 --- This is a single patch that adds Timeout Detection and Recovery (TDR) forwa= rd-progress checking to the amdxdna accelerator driver. The core idea is so= und: instead of treating every scheduler timeout as a device hang, the driv= er checks whether the device has made forward progress (submitted or comple= ted any job) since the last timeout check. If it has, the timeout handler r= eturns `DRM_GPU_SCHED_STAT_NO_HANG` rather than resetting the context. However, the implementation has several significant issues that should be a= ddressed before merging: 1. **Global-level tracking for per-context schedulers** =E2=80=94 the `tdr_= status` field lives on the global `amdxdna_dev_hdl`, but each hardware cont= ext has its own `drm_gpu_scheduler`. A job completing in context A will pre= vent context B from ever detecting a hang. 2. **Module parameter type mismatch** =E2=80=94 declared `uint` but registe= red as `int`. 3. **Non-static global variable** pollutes the module namespace. 4. **Missing initialization** of `tdr_status`. 5. **The timeout change from 60s to 2s is smuggled into this patch** with m= inimal justification. --- Generated by Claude Code Patch Reviewer