From mboxrd@z Thu Jan 1 00:00:00 1970 From: Claude Code Review Bot To: dri-devel-reviews@example.com Subject: Claude review: drm/xe/xe_hw_error: Add support for Core-Compute errors Date: Thu, 05 Mar 2026 13:47:41 +1000 Message-ID: In-Reply-To: <20260304074412.464435-11-riana.tauro@intel.com> References: <20260304074412.464435-7-riana.tauro@intel.com> <20260304074412.464435-11-riana.tauro@intel.com> X-Mailer: Claude Code Patch Reviewer Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Patch Review **Bug: `err_src` type changed but original check broken** ```c unsigned long flags, err_src; ... err_src =3D xe_mmio_read32(&tile->mmio, DEV_ERR_STAT_REG(hw_err)); if (!err_src) { ``` `err_src` is changed from `u32` to `unsigned long` (for `for_each_set_bit`)= . This is fine on 64-bit, but the `xe_mmio_read32` return is u32 =E2=80=94 = on 64-bit the upper 32 bits will be zero, so no issue. But this is worth no= ting for clarity. **Bug: `xe_hw_error_map` is too small for the full register width** ```c static const unsigned long xe_hw_error_map[] =3D { [XE_GT_ERROR] =3D DRM_XE_RAS_ERR_COMP_CORE_COMPUTE, // index 0 }; ``` This array has only 1 entry (after patch 4) or 17 entries (after patch 5 ad= ds `[XE_SOC_ERROR] =3D ...` at index 16). But the `for_each_set_bit` loop i= terates up to `XE_RAS_REG_SIZE` (32 bits): ```c for_each_set_bit(err_bit, &err_src, XE_RAS_REG_SIZE) { if (err_bit >=3D ARRAY_SIZE(xe_hw_error_map)) break; ``` The `break` on out-of-bounds is correct but means any error bits above the = array size will cause the loop to stop entirely, potentially missing lower-= numbered error bits that haven't been processed yet if bits are set in a no= n-sequential order. Wait =E2=80=94 `for_each_set_bit` iterates in ascending= order, so if `ARRAY_SIZE` is 1 (patch 4 only), any bit above 0 will `break= ` the loop. This is actually correct since only bit 0 maps to anything in p= atch 4, but the use of `break` rather than `continue` means if bit 17 (CSC)= is handled before this loop and bit 0 is also set, we'd never reach here d= ue to the `goto clear_reg` after CSC. OK, this works but is fragile. **Concern: Counting methodology may inflate counters** ```c val =3D hweight32(vector); atomic_add(val, &info[error_id].counter); ``` For subslice errors, the code counts the number of set bits in the vector r= egister AND also reads the error status register and counts its set bits, a= ll incrementing the same counter: ```c atomic_add(val, &info[error_id].counter); // vector bits ... for_each_set_bit(errbit, &err_stat, GT_HW_ERROR_MAX_ERR_BITS) { if (PVC_ERROR_MASK_SET(hw_err, errbit)) atomic_inc(&info[error_id].counter); // status bits } ``` This means a single error event could increment the counter by `hweight32(v= ector) + hweight32(err_stat & mask)`. Is this the intended counting behavio= r? It seems like it might double-count or over-count errors. **Minor: Missing `HW_ERR` prefix in some log messages** The new `log_hw_error()` and `log_gt_err()` functions don't use the `HW_ERR= ` prefix: ```c drm_warn(&xe->drm, "%s %s detected\n", name, severity_str); ``` But the original CSC handler and `hw_error_source_handler` do: ```c drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported ..."); ``` This inconsistency makes grep/filtering harder for sysadmins. --- Generated by Claude Code Patch Reviewer