From mboxrd@z Thu Jan 1 00:00:00 1970 From: Claude Code Review Bot To: dri-devel-reviews@example.com Subject: Claude review: accel/amdxdna: add AIE coredump support Date: Wed, 27 May 2026 14:28:07 +1000 Message-ID: In-Reply-To: <20260526172943.1776017-1-lizhi.hou@amd.com> References: <20260526172943.1776017-1-lizhi.hou@amd.com> <20260526172943.1776017-1-lizhi.hou@amd.com> X-Mailer: Claude Code Patch Reviewer Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Patch Review **1. Should be split into two patches** This patch combines a mechanical API refactor (buffer handle abstraction) w= ith new functionality (coredump ioctl). The buffer handle refactoring touch= es `aie2_error.c`, `aie2_message.c`, `aie4_pci.c`, and converts every exist= ing caller of `amdxdna_alloc_msg_buffer` / `amdxdna_free_msg_buffer`. The c= oredump feature adds the new ioctl parameter, message types, and the coredu= mp logic. Splitting these would make bisection and review much easier. **2. Zero-buffer edge case in `amdxdna_get_coredump_cb`** In `aie.c`, the number of buffers is computed as: ```c num_bufs =3D (hwctx->num_col - hwctx->num_unused_col) * aie->metadata.rows; ``` If `num_col =3D=3D num_unused_col`, `num_bufs` is 0, leading to `total_size= =3D 0`. The `element_size < total_size` check passes trivially, then we al= locate a 0-element `buf_list` and 0-element `data_hdls` array, and send a f= irmware message with `num_bufs =3D 0`. This is likely not the intended beha= vior. A guard for `num_bufs =3D=3D 0` should be added, returning an appropr= iate error or short-circuiting. **3. Large kernel memory allocation without limits** Each data buffer is 1MB: ```c size_t data_buf_size =3D SZ_1M; ``` These are allocated in a loop: ```c for (i =3D 0; i < num_bufs; i++) { data_hdls[i] =3D amdxdna_alloc_msg_buffer(xdna, data_buf_size); ``` `num_bufs` is `(num_col - num_unused_col) * metadata.rows`. While these val= ues are hardware-determined, there is no explicit cap on total allocation. = If a context has many active columns and rows, this could result in a very = large kernel memory allocation. Consider adding a sanity check or upper bou= nd on `num_bufs` or `total_size`, or documenting why the hardware-imposed l= imits are sufficient. **4. Permission check uses euid comparison, not pid** The permission check in `amdxdna_get_coredump_cb`: ```c if (!capable(CAP_SYS_ADMIN) && !uid_eq(current_euid(), hwctx->client->filp->filp->f_cred->euid)) { XDNA_ERR(xdna, "Permission denied for context %u", wa->ctx_id); return -EPERM; } ``` This checks the effective UID of the caller against the UID of the DRM file= that created the context. Note that the context is already filtered by `pi= d` and `ctx_id` earlier in the same function: ```c if (hwctx->client->pid !=3D wa->pid || hwctx->id !=3D wa->ctx_id) return 0; ``` The PID match is based on user-supplied `config.pid`, which the user can se= t to any value. The real access control relies on the euid comparison, whic= h is appropriate. However, this means any process running under the same UI= D (not just the owning process) can coredump the context, which may be inte= ntional but worth documenting explicitly. **5. UAPI: buffer serves dual purpose (input and output) without clear docu= mentation** The `DRM_AMDXDNA_AIE_COREDUMP` documentation says: ``` Input: num_element must be 1. buffer points to a user buffer whose size is element_size bytes. The first sizeof(struct amdxdna_drm_aie_coredump) bytes of buffer carry the request (pid + context_id). On success the driver writes rows * cols * 1 MB of tile dump data into buffer. ``` So `buffer` is used as both input (first 16 bytes for the config struct) an= d output (overwritten with coredump data). The `element_size` field has dua= l meaning: on input it must be large enough for the config *and* large enou= gh for the output; on -ENOSPC the driver writes back the required output si= ze, which will be much larger than `sizeof(amdxdna_drm_aie_coredump)`. This= pattern works but is unusual. The documentation should clarify more explic= itly that the config is consumed first and the buffer is then fully overwri= tten. **6. `amdxdna_coredump_buf_entry` uses UAPI types in a kernel-internal head= er** ```c struct amdxdna_coredump_buf_entry { __u64 buf_addr; __u32 buf_size; __u32 reserved; } __packed; ``` This struct in `aie.h` is kernel-internal (used for firmware communication)= , but uses `__u64`/`__u32` (UAPI types) instead of kernel `u64`/`u32`. Mino= r style issue =E2=80=94 kernel-internal structs should use the kernel types. **7. Error semantics change in `amdxdna_alloc_msg_buffer`** The old code returned `ERR_PTR(-EINVAL)` when `order > MAX_PAGE_ORDER`; the= new code returns `ERR_PTR(-ENOMEM)`: ```c if (order > MAX_PAGE_ORDER) goto free_hdl; ... free_hdl: kfree(hdl); return ERR_PTR(-ENOMEM); ``` ENOMEM is arguably more appropriate, but this is a behavioral change for ex= isting callers. If any caller distinguishes EINVAL from ENOMEM, this could = cause issues. In practice, all callers just check `IS_ERR()`, so this is fi= ne but worth noting. **8. `to_buf_size` format specifier mismatch** In `aie2_error.c`, the log format uses `%x` with `to_buf_size()`: ```c XDNA_DBG(xdna, "Async event count %d, buf total size 0x%x", events->event_cnt, to_buf_size(events->hdl)); ``` `to_buf_size()` returns `hdl->size` which is `u32`, so `%x` is correct. No = issue here. **9. `aie4_get_array` only handles coredump** The new `aie4_get_array` function only handles `DRM_AMDXDNA_AIE_COREDUMP`: ```c switch (args->param) { case DRM_AMDXDNA_AIE_COREDUMP: ret =3D amdxdna_get_coredump(&ndev->aie, args); break; default: ret =3D -EOPNOTSUPP; break; } ``` This is being added to `aie4_vf_ops`. Previously VFs had no `get_array` at = all (would return `-EOPNOTSUPP` from the ioctl handler). Now VFs only suppo= rt coredump. This seems intentional but if other `get_array` parameters sho= uld be supported on VFs in the future, they'll need to be added here too. **10. `amdxdna_drm_aie_coredump` struct: `pid` is `__u64` but kernel `pid_t= ` is 32-bit** ```c struct amdxdna_drm_aie_coredump { __u64 pid; __u32 context_id; __u32 pad; }; ``` `pid` is `__u64` in the UAPI struct, but `amdxdna_client.pid` is `pid_t` (3= 2-bit `int`). The comparison `hwctx->client->pid !=3D wa->pid` promotes the= `pid_t` to `u64` correctly, so there's no functional bug. However, using `= __u64` for a PID is unusual =E2=80=94 `__u32` would be more conventional an= d match the kernel type width. Since this is a new UAPI, it should be corre= ct from the start. Consider using `__u32` for `pid` and adding a second `__= u32 pad2` or restructuring. Alternatively, the `__u64` pid could be intenti= onal for alignment or future-proofing, but it should be documented. **11. Minor: `#include ` may not be needed** In `aie.c`, `#include ` is added for `current_euid()` and `ui= d_eq()`. These are typically available through other headers that are likel= y already included transitively. Not a correctness issue, but explicit incl= udes for what you use is good practice. **12. Feature table entry in `npu3_regs.c`** ```c static const struct amdxdna_fw_feature_tbl npu3_fw_feature_table[] =3D { { .major =3D 5, .min_minor =3D 10 }, { .features =3D BIT_U64(AIE4_GET_COREDUMP), .major =3D 5, .min_minor = =3D 24 }, { 0 } }; ``` This correctly gates the coredump feature behind firmware version 5.24+ for= NPU3. The sentinel `{ 0 }` terminates the table. This looks correct. --- **Summary**: The patch is functionally sound for the main path but should (= a) be split into refactoring and feature patches, (b) handle the `num_bufs = =3D=3D 0` edge case, and (c) consider whether `__u64` for `pid` in the UAPI= is the right choice before the ABI is frozen. The permission model and loc= king are correct. --- Generated by Claude Code Patch Reviewer