From mboxrd@z Thu Jan 1 00:00:00 1970 From: Claude Code Review Bot To: dri-devel-reviews@example.com Subject: Claude review: drm/xe/xe_hw_error: Add support for Core-Compute errors Date: Tue, 24 Feb 2026 10:45:42 +1000 Message-ID: In-Reply-To: <20260223060541.526397-11-riana.tauro@intel.com> References: <20260223060541.526397-7-riana.tauro@intel.com> <20260223060541.526397-11-riana.tauro@intel.com> X-Mailer: Claude Code Patch Reviewer Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit MIME-Version: 1.0 Patch Review **Platform guard placement:** The `hw_error_source_handler` changes the platform check from `XE_BATTLEMAGE` to a broader `IS_DGFX(xe)` check: > - if (xe->info.platform != XE_BATTLEMAGE) > + if (!IS_DGFX(xe)) > return; But `gt_hw_error_handler` has its own platform guard: > + if (xe->info.platform != XE_PVC) > + return; This means for Battlemage (which is DGFX but not PVC), `hw_error_source_handler` will now enter the for_each_set_bit loop and try to process GT/SOC errors via `xe_hw_error_map`, but `ras->info[severity]` was never initialized (since `hw_error_info_init` only runs on PVC). The `if (!info) goto clear_reg` check prevents a crash, but the code path that reaches `gt_hw_error_handler` only to immediately return on the PVC check is somewhat wasteful and fragile. If `xe_hw_error_map` is ever extended for another platform, the guards inside the sub-handlers would need updating too. **Double-counting in subslice error path:** > + case ERR_STAT_GT_VECTOR0: > + case ERR_STAT_GT_VECTOR1: { > + u32 errbit; > + > + val = hweight32(vector); > + atomic_add(val, &info[error_id].counter); > + ... > + err_stat = xe_mmio_read32(mmio, ERR_STAT_GT_REG(hw_err)); > + for_each_set_bit(errbit, &err_stat, GT_HW_ERROR_MAX_ERR_BITS) { > + if (PVC_ERROR_MASK_SET(hw_err, errbit)) > + atomic_inc(&info[error_id].counter); > + } For subslice errors, the code first counts by `hweight32(vector)` (number of set bits in the vector register), then also iterates the error status register and increments the counter for each set bit matching the error mask. Are these truly independent error sources that should each contribute to the counter? Or is the error status register providing detail about the same errors reported in the vector? If it's the same errors, this is double-counting. The comment says the status register is "only populated once per error", which suggests it's supplementary detail, not additional errors. If that's the case, the `atomic_inc` for the status register bits is double-counting. **`xe_hw_error_map` size vs `XE_RAS_REG_SIZE`:** > +static const unsigned long xe_hw_error_map[] = { > + [XE_GT_ERROR] = DRM_XE_RAS_ERR_COMP_CORE_COMPUTE, > +}; This array has only 1 element (index 0). In `hw_error_source_handler`: > + if (err_bit >= ARRAY_SIZE(xe_hw_error_map)) > + break; When `err_bit >= 1`, the loop breaks entirely rather than continuing to the next bit. This means if any bit above bit 0 is set in `err_src`, all subsequent bits are skipped entirely. Should this be `continue` instead of `break`? With `break`, a CSC error at bit 17 would never be reached via this loop (though it's handled by the earlier `if (err_src & REG_BIT(XE_CSC_ERROR))` check before the loop). After patch 5 extends the map to include `[XE_SOC_ERROR] = ...` at index 16, this break at index 1 would prevent ever reaching index 16. Looking at it more carefully -- in patch 5, `xe_hw_error_map` is extended to `[16]`, so `ARRAY_SIZE` becomes 17. That solves the problem for SOC errors, but the `break` vs `continue` semantics still matter for bits between 1 and 15 that aren't in the map. Actually wait, with the extended array from patch 5, `ARRAY_SIZE(xe_hw_error_map)` = 17 (indices 0-16), and `err_bit >= 17` would break. `XE_CSC_ERROR` is bit 17, which was already handled before the loop. So the `break` works for this specific set of patches, but it's fragile -- `continue` would be more robust. --- Generated by Claude Code Patch Reviewer