From mboxrd@z Thu Jan 1 00:00:00 1970 From: Claude Code Review Bot To: dri-devel-reviews@example.com Subject: Claude review: drm/xe/xe_hw_error: Add support for Core-Compute errors Date: Tue, 03 Mar 2026 14:32:50 +1000 Message-ID: In-Reply-To: <20260228080858.3063532-11-riana.tauro@intel.com> References: <20260228080858.3063532-7-riana.tauro@intel.com> <20260228080858.3063532-11-riana.tauro@intel.com> X-Mailer: Claude Code Patch Reviewer Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Patch Review **Platform guard broadened without full coverage:** ```c - if (xe->info.platform !=3D XE_BATTLEMAGE) + if (!IS_DGFX(xe)) return; ``` This now enters `hw_error_source_handler` for all discrete GPUs, but `hw_er= ror_info_init` only initializes RAS for PVC. For Battlemage (and future DGF= X), `info` will be NULL. The `if (!info) goto clear_reg` guard prevents a c= rash, but it also means **Battlemage silently stops processing error bits o= ther than CSC** =E2=80=94 a regression from the current code that does proc= ess them (even if just clearing the register). Consider whether this behavi= oral change is intentional. **`xe_hw_error_map` array size issue with `break` vs `continue`:** ```c + for_each_set_bit(err_bit, &err_src, XE_RAS_REG_SIZE) { + /* Check error bit is within bounds */ + if (err_bit >=3D ARRAY_SIZE(xe_hw_error_map)) + break; ``` In this patch, `xe_hw_error_map` has only index `[0]`, so `ARRAY_SIZE` is 1= . The `break` exits the entire loop when any bit > 0 is set. This means if = bit 0 (GT) and bit 17 (CSC) are both set in the same status register read, = CSC would be silently skipped. Using `continue` instead of `break` would be= safer, though CSC is handled separately above. In patch 5 the array grows = to size 17, but bits 1-15 would still exit the loop prematurely due to `bre= ak`. **Potential double-counting in subslice error path:** ```c + case ERR_STAT_GT_VECTOR0: + case ERR_STAT_GT_VECTOR1: { + val =3D hweight32(vector); + atomic_add(val, &info[error_id].counter); + ... + err_stat =3D xe_mmio_read32(mmio, ERR_STAT_GT_REG(hw_err)); + for_each_set_bit(errbit, &err_stat, GT_HW_ERROR_MAX_ERR_BITS) { + if (PVC_ERROR_MASK_SET(hw_err, errbit)) + atomic_inc(&info[error_id].counter); + } ``` The vector register bits and the error status register bits both increment = the same counter. If these represent the same underlying errors reported th= rough two different registers, this double-counts. Please clarify in commen= ts whether these are truly independent error events. **`PVC_ERROR_MASK_SET` macro missing parentheses around `hw_err`:** ```c +#define PVC_ERROR_MASK_SET(hw_err, err_bit) ((hw_err =3D=3D HARDWARE_ERROR= _CORRECTABLE) ? \ ``` Should be `((hw_err) =3D=3D HARDWARE_ERROR_CORRECTABLE)` to prevent operato= r-precedence issues if a complex expression is passed. Same issue in `PVC_GT_VECTOR_LEN`, `ERR_STAT_GT_VECTOR_REG`, and other macr= os using unparenthesized parameters. --- --- Generated by Claude Code Patch Reviewer