From mboxrd@z Thu Jan 1 00:00:00 1970 From: Claude Code Review Bot To: dri-devel-reviews@example.com Subject: Claude review: Introduce Xe Uncorrectable Error Handling Date: Sun, 12 Apr 2026 14:32:33 +1000 Message-ID: In-Reply-To: <20260406142325.157035-7-mallesh.koujalagi@intel.com> References: <20260406142325.157035-6-mallesh.koujalagi@intel.com> <20260406142325.157035-7-mallesh.koujalagi@intel.com> X-Mailer: Claude Code Patch Reviewer Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Patch Review *Marked "DO NOT REVIEW. COMPILATION ONLY" =E2=80=94 reviewed only for issue= s that affect patches 2-4.* **xe_pci_error.c =E2=80=94 `xe_pci_error_slot_reset()`:** ```c pdev->driver->remove(pdev); devres_release_group(&pdev->dev, xe->devres_group_id); if (!pdev->driver->probe(pdev, ent)) return PCI_ERS_RESULT_RECOVERED; ``` Calling `pdev->driver->remove()` followed by `pdev->driver->probe()` direct= ly from a PCI error handler callback is dangerous. After `remove()` runs, t= he `xe` pointer obtained earlier via `pdev_to_xe_device()` may be freed, ma= king the `devres_release_group()` call use-after-free on `xe->devres_group_= id`. The devres group release should happen before or as part of the remove= , not after. **xe_pci_error.c =E2=80=94 `xe_pci_error_handling()`:** ```c /* Wedge the device to prevent userspace access but don't send the event ye= t */ atomic_set(&xe->wedged.flag, 1); ``` This directly sets the wedged flag bypassing `xe_device_declare_wedged()`, = which means no GT wedging, no PM runtime get, no uevent. The comment says "= don't send the event yet" but there's no clear place where the event is eve= ntually sent for this path (io_frozen case). If the slot_reset then removes= and reprobes, the wedge state is lost with the old device. **xe_pci_error.c =E2=80=94 `extern` declaration:** ```c extern const struct pci_error_handlers xe_pci_error_handlers; ``` This `extern` is in `xe_pci.c` instead of a header. It should be declared i= n a proper header (e.g., `xe_pci_error.h` or similar) that both files inclu= de, to get compiler type checking. **xe_pci_error.c =E2=80=94 `ras_recovery_action_to_pci_result[]`:** ```c static const pci_ers_result_t ras_recovery_action_to_pci_result[] =3D { [XE_RAS_RECOVERY_ACTION_RECOVERED] =3D PCI_ERS_RESULT_RECOVERED, [XE_RAS_RECOVERY_ACTION_RESET] =3D PCI_ERS_RESULT_NEED_RESET, [XE_RAS_RECOVERY_ACTION_DISCONNECT] =3D PCI_ERS_RESULT_DISCONNECT, }; ``` No bounds check on the index used to access this array. If `xe_ras_process_= errors()` ever returns `XE_RAS_RECOVERY_ACTION_MAX` or an out-of-range valu= e, this is an out-of-bounds read. **xe_ras.c =E2=80=94 `severity_to_str()` / `comp_to_str()`:** ```c static inline const char *severity_to_str(struct xe_device *xe, u32 severit= y) { xe_assert(xe, severity < XE_RAS_SEVERITY_MAX); return severity < XE_RAS_SEVERITY_MAX ? xe_ras_severities[severity] : "Unk= nown"; } ``` The `xe_assert()` will crash in debug builds for out-of-range values, makin= g the subsequent bounds-check fallback to `"Unknown"` unreachable. Either t= rust the assert (remove the ternary) or handle gracefully (remove the asser= t). Having both is inconsistent. **xe_ras.c =E2=80=94 `handle_soc_internal_errors()` type punning:** ```c struct xe_ras_csc_error *csc_error =3D (struct xe_ras_csc_error *)error_info->additional_details; ``` and: ```c struct xe_ras_ieh_error *ieh_error =3D (struct xe_ras_ieh_error *)error_info->additional_details; ``` Both CSC and IEH errors cast from `additional_details[15]` (a `u32[]`) to d= ifferent struct types. If both `source.csc` and `source.soc` are set simult= aneously, both branches execute and interpret the same `additional_details`= buffer as two different struct types. The code should use `else if` or cla= rify that these are mutually exclusive. **xe_device.h =E2=80=94 `xe_device_clear_in_recovery()`:** ```c static inline void xe_device_clear_in_recovery(struct xe_device *xe) { atomic_set(&xe->in_recovery, 0); } ``` Minor: extra leading space before `atomic_set`. **xe_ras.c =E2=80=94 `xe_ras_process_errors()` loop:** ```c for (int i =3D 0; i < response.num_errors && i < XE_RAS_NUM_ERROR_ARR; i++= ) { ``` Minor: double space before `i < XE_RAS_NUM_ERROR_ARR`. **xe_survivability_mode.c change:** ```c - if (check_boot_failure(xe)) { + /* Survivability info is not required if enabled via configfs */ + if (!xe_configfs_get_survivability_mode(pdev)) { ``` This changes the condition for creating survivability sysfs info from "boot= failure detected" to "not enabled via configfs". This inverts the logic an= d broadens the scope =E2=80=94 previously the info group was only created o= n boot failure, now it's created whenever configfs doesn't have survivabili= ty mode. This seems like a behavioral change that deserves its own commit a= nd proper explanation. --- Generated by Claude Code Patch Reviewer