From mboxrd@z Thu Jan  1 00:00:00 1970
From: Claude Code Review Bot <claude-review@example.com>
To: dri-devel-reviews@example.com
Subject: Claude review: drm/xe/xe_hw_error: Add support for Core-Compute errors
Date: Thu, 05 Mar 2026 13:47:41 +1000
Message-ID: <review-patch4-20260304074412.464435-11-riana.tauro@intel.com>
In-Reply-To: <20260304074412.464435-11-riana.tauro@intel.com>
References: <20260304074412.464435-7-riana.tauro@intel.com> <20260304074412.464435-11-riana.tauro@intel.com>
X-Mailer: Claude Code Patch Reviewer
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Patch Review

**Bug: `err_src` type changed but original check broken**

```c
unsigned long flags, err_src;
...
err_src =3D xe_mmio_read32(&tile->mmio, DEV_ERR_STAT_REG(hw_err));
if (!err_src) {
```

`err_src` is changed from `u32` to `unsigned long` (for `for_each_set_bit`)=
. This is fine on 64-bit, but the `xe_mmio_read32` return is u32 =E2=80=94 =
on 64-bit the upper 32 bits will be zero, so no issue. But this is worth no=
ting for clarity.

**Bug: `xe_hw_error_map` is too small for the full register width**

```c
static const unsigned long xe_hw_error_map[] =3D {
    [XE_GT_ERROR]   =3D DRM_XE_RAS_ERR_COMP_CORE_COMPUTE,  // index 0
};
```

This array has only 1 entry (after patch 4) or 17 entries (after patch 5 ad=
ds `[XE_SOC_ERROR] =3D ...` at index 16). But the `for_each_set_bit` loop i=
terates up to `XE_RAS_REG_SIZE` (32 bits):

```c
for_each_set_bit(err_bit, &err_src, XE_RAS_REG_SIZE) {
    if (err_bit >=3D ARRAY_SIZE(xe_hw_error_map))
        break;
```

The `break` on out-of-bounds is correct but means any error bits above the =
array size will cause the loop to stop entirely, potentially missing lower-=
numbered error bits that haven't been processed yet if bits are set in a no=
n-sequential order. Wait =E2=80=94 `for_each_set_bit` iterates in ascending=
 order, so if `ARRAY_SIZE` is 1 (patch 4 only), any bit above 0 will `break=
` the loop. This is actually correct since only bit 0 maps to anything in p=
atch 4, but the use of `break` rather than `continue` means if bit 17 (CSC)=
 is handled before this loop and bit 0 is also set, we'd never reach here d=
ue to the `goto clear_reg` after CSC. OK, this works but is fragile.

**Concern: Counting methodology may inflate counters**

```c
val =3D hweight32(vector);
atomic_add(val, &info[error_id].counter);
```

For subslice errors, the code counts the number of set bits in the vector r=
egister AND also reads the error status register and counts its set bits, a=
ll incrementing the same counter:

```c
atomic_add(val, &info[error_id].counter);  // vector bits
...
for_each_set_bit(errbit, &err_stat, GT_HW_ERROR_MAX_ERR_BITS) {
    if (PVC_ERROR_MASK_SET(hw_err, errbit))
        atomic_inc(&info[error_id].counter);  // status bits
}
```

This means a single error event could increment the counter by `hweight32(v=
ector) + hweight32(err_stat & mask)`. Is this the intended counting behavio=
r? It seems like it might double-count or over-count errors.

**Minor: Missing `HW_ERR` prefix in some log messages**

The new `log_hw_error()` and `log_gt_err()` functions don't use the `HW_ERR=
` prefix:

```c
drm_warn(&xe->drm, "%s %s detected\n", name, severity_str);
```

But the original CSC handler and `hw_error_source_handler` do:

```c
drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported ...");
```

This inconsistency makes grep/filtering harder for sysadmins.

---
Generated by Claude Code Patch Reviewer