Claude review: drm/xe/xe_hw_error: Add support for Core-Compute errors

public inbox for drm-ai-reviews@public-inbox.freedesktop.org
 help / color / mirror / Atom feed

From: Claude Code Review Bot <claude-review@example.com>
To: dri-devel-reviews@example.com
Subject: Claude review: drm/xe/xe_hw_error: Add support for Core-Compute errors
Date: Thu, 05 Mar 2026 13:47:41 +1000	[thread overview]
Message-ID: <review-patch4-20260304074412.464435-11-riana.tauro@intel.com> (raw)
In-Reply-To: <20260304074412.464435-11-riana.tauro@intel.com>

Patch Review

**Bug: `err_src` type changed but original check broken**

```c
unsigned long flags, err_src;
...
err_src = xe_mmio_read32(&tile->mmio, DEV_ERR_STAT_REG(hw_err));
if (!err_src) {
```

`err_src` is changed from `u32` to `unsigned long` (for `for_each_set_bit`). This is fine on 64-bit, but the `xe_mmio_read32` return is u32 — on 64-bit the upper 32 bits will be zero, so no issue. But this is worth noting for clarity.

**Bug: `xe_hw_error_map` is too small for the full register width**

```c
static const unsigned long xe_hw_error_map[] = {
    [XE_GT_ERROR]   = DRM_XE_RAS_ERR_COMP_CORE_COMPUTE,  // index 0
};
```

This array has only 1 entry (after patch 4) or 17 entries (after patch 5 adds `[XE_SOC_ERROR] = ...` at index 16). But the `for_each_set_bit` loop iterates up to `XE_RAS_REG_SIZE` (32 bits):

```c
for_each_set_bit(err_bit, &err_src, XE_RAS_REG_SIZE) {
    if (err_bit >= ARRAY_SIZE(xe_hw_error_map))
        break;
```

The `break` on out-of-bounds is correct but means any error bits above the array size will cause the loop to stop entirely, potentially missing lower-numbered error bits that haven't been processed yet if bits are set in a non-sequential order. Wait — `for_each_set_bit` iterates in ascending order, so if `ARRAY_SIZE` is 1 (patch 4 only), any bit above 0 will `break` the loop. This is actually correct since only bit 0 maps to anything in patch 4, but the use of `break` rather than `continue` means if bit 17 (CSC) is handled before this loop and bit 0 is also set, we'd never reach here due to the `goto clear_reg` after CSC. OK, this works but is fragile.

**Concern: Counting methodology may inflate counters**

```c
val = hweight32(vector);
atomic_add(val, &info[error_id].counter);
```

For subslice errors, the code counts the number of set bits in the vector register AND also reads the error status register and counts its set bits, all incrementing the same counter:

```c
atomic_add(val, &info[error_id].counter);  // vector bits
...
for_each_set_bit(errbit, &err_stat, GT_HW_ERROR_MAX_ERR_BITS) {
    if (PVC_ERROR_MASK_SET(hw_err, errbit))
        atomic_inc(&info[error_id].counter);  // status bits
}
```

This means a single error event could increment the counter by `hweight32(vector) + hweight32(err_stat & mask)`. Is this the intended counting behavior? It seems like it might double-count or over-count errors.

**Minor: Missing `HW_ERR` prefix in some log messages**

The new `log_hw_error()` and `log_gt_err()` functions don't use the `HW_ERR` prefix:

```c
drm_warn(&xe->drm, "%s %s detected\n", name, severity_str);
```

But the original CSC handler and `hw_error_source_handler` do:

```c
drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported ...");
```

This inconsistency makes grep/filtering harder for sysadmins.

---
Generated by Claude Code Patch Reviewer

next prev parent reply	other threads:[~2026-03-05  3:47 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-04  7:44 [PATCH v10 0/5] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
2026-03-04  7:44 ` [PATCH v10 1/5] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Riana Tauro
2026-03-05  3:47   ` Claude review: " Claude Code Review Bot
2026-03-04  7:44 ` [PATCH v10 2/5] drm/xe/xe_drm_ras: Add support for XE DRM RAS Riana Tauro
2026-03-05  3:47   ` Claude review: " Claude Code Review Bot
2026-03-04  7:44 ` [PATCH v10 3/5] drm/xe/xe_hw_error: Integrate DRM RAS with hardware error handling Riana Tauro
2026-03-05  3:47   ` Claude review: " Claude Code Review Bot
2026-03-04  7:44 ` [PATCH v10 4/5] drm/xe/xe_hw_error: Add support for Core-Compute errors Riana Tauro
2026-03-05  3:47   ` Claude Code Review Bot [this message]
2026-03-04  7:44 ` [PATCH v10 5/5] drm/xe/xe_hw_error: Add support for PVC SoC errors Riana Tauro
2026-03-05  3:47   ` Claude review: " Claude Code Review Bot
2026-03-05  3:47 ` Claude review: Introduce DRM_RAS using generic netlink for RAS Claude Code Review Bot
  -- strict thread matches above, loose matches on Subject: below --
2026-02-28  8:08 [PATCH v9 0/5] " Riana Tauro
2026-02-28  8:08 ` [PATCH v9 4/5] drm/xe/xe_hw_error: Add support for Core-Compute errors Riana Tauro
2026-03-03  4:32   ` Claude review: " Claude Code Review Bot
2026-02-23  6:05 [PATCH v8 0/5] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
2026-02-23  6:05 ` [PATCH v8 4/5] drm/xe/xe_hw_error: Add support for Core-Compute errors Riana Tauro
2026-02-24  0:45   ` Claude review: " Claude Code Review Bot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=review-patch4-20260304074412.464435-11-riana.tauro@intel.com \
    --to=claude-review@example.com \
    --cc=dri-devel-reviews@example.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox