From mboxrd@z Thu Jan 1 00:00:00 1970 From: Claude Code Review Bot To: dri-devel-reviews@example.com Subject: Claude review: drm/xe/xe_hw_error: Add support for PVC SoC errors Date: Tue, 24 Feb 2026 10:45:42 +1000 Message-ID: In-Reply-To: <20260223060541.526397-12-riana.tauro@intel.com> References: <20260223060541.526397-7-riana.tauro@intel.com> <20260223060541.526397-12-riana.tauro@intel.com> X-Mailer: Claude Code Patch Reviewer Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit MIME-Version: 1.0 Patch Review **Correctable SoC errors not counted:** > + if (hw_err == HARDWARE_ERROR_CORRECTABLE) { > + xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(master, hw_err), REG_GENMASK(31, 0)); > + xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(master, hw_err), REG_GENMASK(31, 0)); > + xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(slave, hw_err), REG_GENMASK(31, 0)); > + xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(slave, hw_err), REG_GENMASK(31, 0)); > + goto unmask_gsysevtctl; > + } For correctable SoC errors, the status registers are cleared (written with all-ones) without reading them or incrementing any counter. This means the `soc-internal` correctable error counter will always remain 0, regardless of how many correctable SoC errors occur. Is this intentional? It seems like a gap -- for GT errors in `gt_hw_error_handler`, correctable errors DO get counted, but SoC correctable errors are silently discarded. **Magic value for GSYSEVTCTL unmask:** > +unmask_gsysevtctl: > + for (i = 0; i < XE_SOC_NUM_IEH; i++) > + xe_mmio_write32(mmio, SOC_GSYSEVTCTL_REG(master, slave, i), > + (HARDWARE_ERROR_MAX << 1) + 1); `(HARDWARE_ERROR_MAX << 1) + 1` evaluates to `(3 << 1) + 1 = 7 = 0b111`. This presumably enables all three error severity levels. The expression relies on `HARDWARE_ERROR_MAX` being 3 and on the register bit layout matching. A named constant or comment explaining the bit layout would reduce the chance of breakage if `HARDWARE_ERROR_MAX` is ever changed. **Missing nonfatal local error handling in slave IEH:** > + if (hw_err == HARDWARE_ERROR_FATAL) { > + for_each_set_bit(regbit, &slave_local_errstat, XE_RAS_REG_SIZE) > + log_soc_error(tile, pvc_slave_local_fatal_err_reg, severity, > + regbit, error_id); > + } `soc_slave_ieh_handler` is called for both NONFATAL and FATAL hw_err (correctable was filtered out earlier). For NONFATAL, the slave local error register is read and cleared, but its bits are not processed or logged. There's no `pvc_slave_local_nonfatal_err_reg` array. If the hardware can report nonfatal slave local errors, they would be silently dropped without counting or logging. **`strcmp` for filtering "Undefined" errors:** > + name = reg_info[err_bit]; > + > + if (strcmp(name, "Undefined")) { Using `strcmp` against a string literal to filter out placeholder entries works but is fragile. A more robust approach would be to use NULL for undefined entries and check `if (name)`. The current approach also means every bit that IS set in the register and mapped to "Undefined" triggers a `strcmp` call, which is minor overhead but unnecessary. --- Generated by Claude Code Patch Reviewer