* [PATCH v6 0/5] Introduce cold reset recovery method
@ 2026-05-20 11:33 Mallesh Koujalagi
2026-05-20 11:33 ` [PATCH v6 1/5] Introduce Xe Uncorrectable Error Handling Mallesh Koujalagi
` (5 more replies)
0 siblings, 6 replies; 14+ messages in thread
From: Mallesh Koujalagi @ 2026-05-20 11:33 UTC (permalink / raw)
To: intel-xe, dri-devel, rodrigo.vivi
Cc: andrealmeid, christian.koenig, airlied, simona.vetter, mripard,
maarten.lankhorst, tzimmermann, anshuman.gupta, badal.nilawar,
riana.tauro, karthik.poosa, sk.anirban, raag.jadav,
Mallesh Koujalagi
This series builds on top of Introduce Xe Uncorrectable Error Handling[1]
and adds support for handling errors that require a complete
device power cycle (cold reset) to recover.
Certain error conditions leave the device in a persistent hardware
error state that cannot be cleared through existing recovery mechanisms
such as driver reload or PCIe reset. In these cases, functionality can
only be restored by performing a cold reset.
To support this, the series introduces a new DRM wedging recovery
method, DRM_WEDGE_RECOVERY_COLD_RESET (BIT(4)). When a device is wedged
with this method, the DRM core notifies userspace via a uevent that a cold
reset is required. This allows userspace to take appropriate action to
power-cycle the device.
Example uevent received:
SUBSYSTEM=drm
WEDGED=cold-reset
DEVPATH=/devices/.../drm/card0
Detailed description in commit message.
[1] https://patchwork.freedesktop.org/series/160482/
This patch series introduces a call to punit_error_handler() from
within handle_soc_internal_errors() when PUNIT errors detected.
v2:
- Add use case: Handling errors from power management unit,
which requires a complete power cycle to
recover. (Christian)
- Add several instead of number to avoid update. (Jani)
v3:
- Update any scenario that requires cold-reset. (Riana)
- Update document with generic scenario. (Riana)
- Consistent with terminology. (Raag)
- Remove already covered information.
- Use PUNIT instead of PMU. (Riana)
- Use consistent wordingi.
- Remove log. (Raag)
v4:
- Rename cold reset to power cyclce. (Raag)
- Update doc. (Raag/Riana)
- Change commit message. (Raag)
- Make function static. (Raag)
v5:
- Make it consistent with consumer expectations. (Raag)
- Update commit message.
- Remove unbind.
- Simplify cold-reset script.
- Remove kdoc for static function.
- Remove xe_ prefix for static function.
v6:
- Drop "last resort" wording. (Riana)
- Look up the hotplug slot in DEVPATH instead of scanning
every PCI slot on the system. (Raag)
- Drop arbitrary sleep values from the example script.
- Expand commit message to explain why SUR_DN is masked. (Raag/Riana)
- Check Slot Implemented bit before reading Slot Capabilities, per
PCIe spec. (Riana)
- Add debug log.
Cc: André Almeida <andrealmeid@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona.vetter@ffwll.ch>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Mallesh Koujalagi (4):
drm: Add DRM_WEDGE_RECOVERY_COLD_RESET recovery method
drm/doc: Document DRM_WEDGE_RECOVERY_COLD_RESET recovery method
drm/xe: Handle PUNIT errors by requesting cold-reset recovery
drm/xe: Suppress Surprise Link Down on non-hotplug device
Riana Tauro (1):
Introduce Xe Uncorrectable Error Handling
Documentation/gpu/drm-uapi.rst | 60 +-
drivers/gpu/drm/drm_drv.c | 2 +
drivers/gpu/drm/xe/Makefile | 1 +
drivers/gpu/drm/xe/xe_device.c | 19 +-
drivers/gpu/drm/xe/xe_device.h | 15 +
drivers/gpu/drm/xe/xe_device_types.h | 6 +
drivers/gpu/drm/xe/xe_gt.c | 14 +-
drivers/gpu/drm/xe/xe_guc_submit.c | 9 +-
drivers/gpu/drm/xe/xe_pci.c | 10 +
drivers/gpu/drm/xe/xe_pci_error.c | 138 +++++
drivers/gpu/drm/xe/xe_ras.c | 567 ++++++++++++++++++
drivers/gpu/drm/xe/xe_ras.h | 5 +-
drivers/gpu/drm/xe/xe_ras_types.h | 215 +++++++
drivers/gpu/drm/xe/xe_survivability_mode.c | 13 +-
drivers/gpu/drm/xe/xe_sysctrl_event.c | 2 +-
drivers/gpu/drm/xe/xe_sysctrl_event_types.h | 2 +-
drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h | 11 +
include/drm/drm_device.h | 1 +
18 files changed, 1069 insertions(+), 21 deletions(-)
create mode 100644 drivers/gpu/drm/xe/xe_pci_error.c
--
2.34.1
^ permalink raw reply [flat|nested] 14+ messages in thread* [PATCH v6 1/5] Introduce Xe Uncorrectable Error Handling 2026-05-20 11:33 [PATCH v6 0/5] Introduce cold reset recovery method Mallesh Koujalagi @ 2026-05-20 11:33 ` Mallesh Koujalagi 2026-05-25 11:57 ` Claude review: " Claude Code Review Bot 2026-05-20 11:33 ` [PATCH v6 2/5] drm: Add DRM_WEDGE_RECOVERY_COLD_RESET recovery method Mallesh Koujalagi ` (4 subsequent siblings) 5 siblings, 1 reply; 14+ messages in thread From: Mallesh Koujalagi @ 2026-05-20 11:33 UTC (permalink / raw) To: intel-xe, dri-devel, rodrigo.vivi Cc: andrealmeid, christian.koenig, airlied, simona.vetter, mripard, maarten.lankhorst, tzimmermann, anshuman.gupta, badal.nilawar, riana.tauro, karthik.poosa, sk.anirban, raag.jadav, Mallesh Koujalagi From: Riana Tauro <riana.tauro@intel.com> DO NOT REVIEW. COMPILATION ONLY This patch is from https://patchwork.freedesktop.org/series/160482/ Added only for Compilation. Signed-off-by: Riana Tauro <riana.tauro@intel.com> Signed-off-by: Mallesh Koujalagi <mallesh.koujalagi@intel.com> --- drivers/gpu/drm/xe/Makefile | 1 + drivers/gpu/drm/xe/xe_device.c | 19 +- drivers/gpu/drm/xe/xe_device.h | 15 + drivers/gpu/drm/xe/xe_device_types.h | 6 + drivers/gpu/drm/xe/xe_gt.c | 14 +- drivers/gpu/drm/xe/xe_guc_submit.c | 9 +- drivers/gpu/drm/xe/xe_pci.c | 10 + drivers/gpu/drm/xe/xe_pci_error.c | 138 +++++ drivers/gpu/drm/xe/xe_ras.c | 493 ++++++++++++++++++ drivers/gpu/drm/xe/xe_ras.h | 5 +- drivers/gpu/drm/xe/xe_ras_types.h | 215 ++++++++ drivers/gpu/drm/xe/xe_survivability_mode.c | 13 +- drivers/gpu/drm/xe/xe_sysctrl_event.c | 2 +- drivers/gpu/drm/xe/xe_sysctrl_event_types.h | 2 +- drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h | 11 + 15 files changed, 933 insertions(+), 20 deletions(-) create mode 100644 drivers/gpu/drm/xe/xe_pci_error.c diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile index 09661f079d03..091872771e98 100644 --- a/drivers/gpu/drm/xe/Makefile +++ b/drivers/gpu/drm/xe/Makefile @@ -101,6 +101,7 @@ xe-y += xe_bb.o \ xe_page_reclaim.o \ xe_pat.o \ xe_pci.o \ + xe_pci_error.o \ xe_pci_rebar.o \ xe_pcode.o \ xe_pm.o \ diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c index 576095cf0952..5a852e353258 100644 --- a/drivers/gpu/drm/xe/xe_device.c +++ b/drivers/gpu/drm/xe/xe_device.c @@ -61,6 +61,7 @@ #include "xe_psmi.h" #include "xe_pxp.h" #include "xe_query.h" +#include "xe_ras.h" #include "xe_shrinker.h" #include "xe_soc_remapper.h" #include "xe_survivability_mode.h" @@ -963,6 +964,16 @@ int xe_device_probe(struct xe_device *xe) if (err) return err; + err = xe_soc_remapper_init(xe); + if (err) + return err; + + err = xe_sysctrl_init(xe); + if (err) + return err; + + xe_ras_init(xe); + /* * Now that GT is initialized (TTM in particular), * we can try to init display, and inherit the initial fb. @@ -1003,10 +1014,6 @@ int xe_device_probe(struct xe_device *xe) xe_nvm_init(xe); - err = xe_soc_remapper_init(xe); - if (err) - return err; - err = xe_heci_gsc_init(xe); if (err) return err; @@ -1045,10 +1052,6 @@ int xe_device_probe(struct xe_device *xe) if (err) goto err_unregister_display; - err = xe_sysctrl_init(xe); - if (err) - goto err_unregister_display; - err = xe_device_sysfs_init(xe); if (err) goto err_unregister_display; diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h index 355d69dc8f54..765e90f4220f 100644 --- a/drivers/gpu/drm/xe/xe_device.h +++ b/drivers/gpu/drm/xe/xe_device.h @@ -181,6 +181,21 @@ static inline bool xe_device_has_mert(const struct xe_device *xe) return xe->info.has_mert; } +static inline bool xe_device_is_in_reset(struct xe_device *xe) +{ + return atomic_read(&xe->in_reset); +} + +static inline void xe_device_set_in_reset(struct xe_device *xe) +{ + atomic_set(&xe->in_reset, 1); +} + +static inline void xe_device_clear_in_reset(struct xe_device *xe) +{ + atomic_set(&xe->in_reset, 0); +} + u32 xe_device_ccs_bytes(struct xe_device *xe, u64 size); void xe_device_snapshot_print(struct xe_device *xe, struct drm_printer *p); diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h index 32dd2ffbc796..f64e1a149cee 100644 --- a/drivers/gpu/drm/xe/xe_device_types.h +++ b/drivers/gpu/drm/xe/xe_device_types.h @@ -483,6 +483,9 @@ struct xe_device { /** @needs_flr_on_fini: requests function-reset on fini */ bool needs_flr_on_fini; + /** @in_reset: Indicates if device is in reset */ + atomic_t in_reset; + /** @wedged: Struct to control Wedged States and mode */ struct { /** @wedged.flag: Xe device faced a critical error and is now blocked. */ @@ -495,6 +498,9 @@ struct xe_device { bool inconsistent_reset; } wedged; + /** @devres_group_id: id for devres group */ + void *devres_group_id; + /** @bo_device: Struct to control async free of BOs */ struct xe_bo_dev { /** @bo_device.async_free: Free worker */ diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c index 783eb6d631b5..c117f6cf5e3e 100644 --- a/drivers/gpu/drm/xe/xe_gt.c +++ b/drivers/gpu/drm/xe/xe_gt.c @@ -917,6 +917,9 @@ static void gt_reset_worker(struct work_struct *w) if (xe_device_wedged(gt_to_xe(gt))) goto err_pm_put; + if (xe_device_is_in_reset(gt_to_xe(gt))) + goto err_pm_put; + /* We only support GT resets with GuC submission */ if (!xe_device_uc_enabled(gt_to_xe(gt))) goto err_pm_put; @@ -977,18 +980,23 @@ static void gt_reset_worker(struct work_struct *w) void xe_gt_reset_async(struct xe_gt *gt) { - xe_gt_info(gt, "trying reset from %ps\n", __builtin_return_address(0)); + struct xe_device *xe = gt_to_xe(gt); + + if (xe_device_is_in_reset(xe)) + return; /* Don't do a reset while one is already in flight */ if (!xe_fault_inject_gt_reset() && xe_uc_reset_prepare(>->uc)) return; + xe_gt_info(gt, "trying reset from %ps\n", __builtin_return_address(0)); + xe_gt_info(gt, "reset queued\n"); /* Pair with put in gt_reset_worker() if work is enqueued */ - xe_pm_runtime_get_noresume(gt_to_xe(gt)); + xe_pm_runtime_get_noresume(xe); if (!queue_work(gt->ordered_wq, >->reset.worker)) - xe_pm_runtime_put(gt_to_xe(gt)); + xe_pm_runtime_put(xe); } void xe_gt_suspend_prepare(struct xe_gt *gt) diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c index afd8cc7bd231..f21a1f709c4a 100644 --- a/drivers/gpu/drm/xe/xe_guc_submit.c +++ b/drivers/gpu/drm/xe/xe_guc_submit.c @@ -1514,7 +1514,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job) * If devcoredump not captured and GuC capture for the job is not ready * do manual capture first and decide later if we need to use it */ - if (!exec_queue_killed(q) && !xe->devcoredump.captured && + if (!xe_device_is_in_reset(xe) && !exec_queue_killed(q) && !xe->devcoredump.captured && !xe_guc_capture_get_matching_and_lock(q)) { /* take force wake before engine register manual capture */ CLASS(xe_force_wake, fw_ref)(gt_to_fw(q->gt), XE_FORCEWAKE_ALL); @@ -1536,8 +1536,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job) set_exec_queue_banned(q); /* Kick job / queue off hardware */ - if (!wedged && (exec_queue_enabled(primary) || - exec_queue_pending_disable(primary))) { + if (!xe_device_is_in_reset(xe) && !wedged && + (exec_queue_enabled(primary) || exec_queue_pending_disable(primary))) { int ret; if (exec_queue_reset(primary)) @@ -1605,7 +1605,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job) trace_xe_sched_job_timedout(job); - if (!exec_queue_killed(q)) + /* Do not access device if in reset */ + if (!xe_device_is_in_reset(xe) && !exec_queue_killed(q)) xe_devcoredump(q, job, "Timedout job - seqno=%u, lrc_seqno=%u, guc_id=%d, flags=0x%lx", xe_sched_job_seqno(job), xe_sched_job_lrc_seqno(job), diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c index e2a1232c4fbf..b0b5f2fe7ba8 100644 --- a/drivers/gpu/drm/xe/xe_pci.c +++ b/drivers/gpu/drm/xe/xe_pci.c @@ -1063,6 +1063,7 @@ static int xe_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent) const struct xe_device_desc *desc = (const void *)ent->driver_data; const struct xe_subplatform_desc *subplatform_desc; struct xe_device *xe; + void *devres_id; int err; xe_configfs_check_device(pdev); @@ -1088,6 +1089,10 @@ static int xe_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent) if (xe_display_driver_probe_defer(pdev)) return -EPROBE_DEFER; + devres_id = devres_open_group(&pdev->dev, NULL, GFP_KERNEL); + if (!devres_id) + return -ENOMEM; + err = pcim_enable_device(pdev); if (err) return err; @@ -1096,6 +1101,8 @@ static int xe_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent) if (IS_ERR(xe)) return PTR_ERR(xe); + xe->devres_group_id = devres_id; + pci_set_drvdata(pdev, &xe->drm); xe_pm_assert_unbounded_bridge(xe); @@ -1331,6 +1338,8 @@ static const struct dev_pm_ops xe_pm_ops = { }; #endif +extern const struct pci_error_handlers xe_pci_error_handlers; + static struct pci_driver xe_pci_driver = { .name = DRIVER_NAME, .id_table = pciidlist, @@ -1338,6 +1347,7 @@ static struct pci_driver xe_pci_driver = { .remove = xe_pci_remove, .shutdown = xe_pci_shutdown, .sriov_configure = xe_pci_sriov_configure, + .err_handler = &xe_pci_error_handlers, #ifdef CONFIG_PM_SLEEP .driver.pm = &xe_pm_ops, #endif diff --git a/drivers/gpu/drm/xe/xe_pci_error.c b/drivers/gpu/drm/xe/xe_pci_error.c new file mode 100644 index 000000000000..8d62bcbcbbb6 --- /dev/null +++ b/drivers/gpu/drm/xe/xe_pci_error.c @@ -0,0 +1,138 @@ +// SPDX-License-Identifier: MIT +/* + * Copyright © 2026 Intel Corporation + */ +#include <linux/pci.h> + +#include <drm/drm_drv.h> + +#include "xe_device.h" +#include "xe_gt.h" +#include "xe_pci.h" +#include "xe_ras.h" +#include "xe_survivability_mode.h" +#include "xe_uc.h" + +static pci_ers_result_t ras_action_to_pci_result(enum xe_ras_recovery_action action) +{ + switch (action) { + case XE_RAS_RECOVERY_ACTION_RECOVERED: + return PCI_ERS_RESULT_RECOVERED; + case XE_RAS_RECOVERY_ACTION_RESET: + return PCI_ERS_RESULT_NEED_RESET; + case XE_RAS_RECOVERY_ACTION_DISCONNECT: + return PCI_ERS_RESULT_DISCONNECT; + default: + return PCI_ERS_RESULT_DISCONNECT; + } +} + +static void xe_pci_error_handling(struct pci_dev *pdev) +{ + struct xe_device *xe = pdev_to_xe_device(pdev); + struct xe_gt *gt; + u8 id; + + /* + * Wedge the device to prevent userspace access but don't send the event yet. + * Runtime PM ref is taken by PCI core for the duration of error handling. + */ + xe_device_set_in_reset(xe); + atomic_set(&xe->wedged.flag, 1); + + for_each_gt(gt, xe, id) + xe_gt_declare_wedged(gt); + + pci_disable_device(pdev); +} + +static pci_ers_result_t xe_pci_error_detected(struct pci_dev *pdev, pci_channel_state_t state) +{ + struct xe_device *xe = pdev_to_xe_device(pdev); + + dev_err(&pdev->dev, "Xe Pci error recovery: error detected state %d\n", state); + + if (state == pci_channel_io_perm_failure) + return PCI_ERS_RESULT_DISCONNECT; + + /* If the device is already wedged or in survivability mode, do not attempt recovery */ + if (xe_survivability_mode_is_boot_enabled(xe) || xe_device_wedged(xe)) + return PCI_ERS_RESULT_DISCONNECT; + + switch (state) { + case pci_channel_io_normal: + return PCI_ERS_RESULT_CAN_RECOVER; + case pci_channel_io_frozen: + xe_pci_error_handling(pdev); + return PCI_ERS_RESULT_NEED_RESET; + default: + dev_err(&pdev->dev, "Unknown state %d\n", state); + return PCI_ERS_RESULT_NEED_RESET; + } +} + +static pci_ers_result_t xe_pci_error_mmio_enabled(struct pci_dev *pdev) +{ + struct xe_device *xe = pdev_to_xe_device(pdev); + enum xe_ras_recovery_action action; + + dev_err(&pdev->dev, "Xe Pci error recovery: MMIO enabled\n"); + + action = xe_ras_process_errors(xe); + + return ras_action_to_pci_result(action); +} + +static pci_ers_result_t xe_pci_error_slot_reset(struct pci_dev *pdev) +{ + const struct pci_device_id *ent = pci_match_id(pdev->driver->id_table, pdev); + struct xe_device *xe = pdev_to_xe_device(pdev); + + dev_err(&pdev->dev, "Xe Pci error recovery: Slot reset\n"); + + pci_restore_state(pdev); + + if (pci_enable_device(pdev)) { + dev_err(&pdev->dev, + "Cannot re-enable PCI device after reset\n"); + return PCI_ERS_RESULT_DISCONNECT; + } + + /* + * Secondary Bus Reset causes all VRAM state to be lost along with + * hardware state. As an initial step, re-probe the device to + * re-initialize the driver and hardware. + * TODO: optimize by re-initializing only the hardware state and re-creating + * kernel BOs. + */ + xe_device_clear_in_reset(xe); + pdev->driver->remove(pdev); + devres_release_group(&pdev->dev, xe->devres_group_id); + + if (pdev->driver->probe(pdev, ent)) + return PCI_ERS_RESULT_DISCONNECT; + + xe = pdev_to_xe_device(pdev); + + /* Wedge the device to prevent I/O operations till the resume callback */ + atomic_set(&xe->wedged.flag, 1); + + return PCI_ERS_RESULT_RECOVERED; +} + +static void xe_pci_error_resume(struct pci_dev *pdev) +{ + struct xe_device *xe = pdev_to_xe_device(pdev); + + dev_info(&pdev->dev, "Xe Pci error recovery: Recovered\n"); + + /* Resume I/O operations */ + atomic_set(&xe->wedged.flag, 0); +} + +const struct pci_error_handlers xe_pci_error_handlers = { + .error_detected = xe_pci_error_detected, + .mmio_enabled = xe_pci_error_mmio_enabled, + .slot_reset = xe_pci_error_slot_reset, + .resume = xe_pci_error_resume, +}; diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c index 4cb16b419b0c..d79f8a6589ac 100644 --- a/drivers/gpu/drm/xe/xe_ras.c +++ b/drivers/gpu/drm/xe/xe_ras.c @@ -3,12 +3,18 @@ * Copyright © 2026 Intel Corporation */ +#include "xe_bo.h" #include "xe_device.h" #include "xe_printk.h" #include "xe_ras.h" #include "xe_ras_types.h" #include "xe_sysctrl.h" #include "xe_sysctrl_event_types.h" +#include "xe_sysctrl_mailbox.h" +#include "xe_sysctrl_mailbox_types.h" + +#define CORE_COMPUTE_UNCORR_TYPE GENMASK(26, 25) +#define GLOBAL_UNCORR_ERROR 2 /* Severity of detected errors */ enum xe_ras_severity { @@ -31,12 +37,25 @@ enum xe_ras_component { XE_RAS_COMP_MAX }; +static const int ras_status_to_errno_map[] = { + [XE_RAS_STATUS_SUCCESS] = 0, + [XE_RAS_STATUS_INVALID_PARAM] = -EINVAL, + [XE_RAS_STATUS_OP_NOT_SUPPORTED] = -EOPNOTSUPP, + [XE_RAS_STATUS_TIMEOUT] = -ETIMEDOUT, + [XE_RAS_STATUS_HARDWARE_FAILURE] = -EIO, + [XE_RAS_STATUS_INSUFFICIENT_RESOURCES] = -ENAVAIL, + [XE_RAS_STATUS_UNKNOWN_ERROR] = -ENODATA +}; + +static_assert(ARRAY_SIZE(ras_status_to_errno_map) == XE_RAS_STATUS_UNKNOWN_ERROR + 1); + static const char *const xe_ras_severities[] = { [XE_RAS_SEV_NOT_SUPPORTED] = "Not Supported", [XE_RAS_SEV_CORRECTABLE] = "Correctable Error", [XE_RAS_SEV_UNCORRECTABLE] = "Uncorrectable Error", [XE_RAS_SEV_INFORMATIONAL] = "Informational Error", }; + static_assert(ARRAY_SIZE(xe_ras_severities) == XE_RAS_SEV_MAX); static const char *const xe_ras_components[] = { @@ -48,6 +67,7 @@ static const char *const xe_ras_components[] = { [XE_RAS_COMP_FABRIC] = "Fabric", [XE_RAS_COMP_SOC_INTERNAL] = "SoC Internal", }; + static_assert(ARRAY_SIZE(xe_ras_components) == XE_RAS_COMP_MAX); static inline const char *sev_to_str(u8 severity) @@ -66,6 +86,296 @@ static inline const char *comp_to_str(u8 component) return xe_ras_components[component]; } +static int ras_status_to_errno(enum xe_ras_response_status status) +{ + if (status > XE_RAS_STATUS_UNKNOWN_ERROR) + status = XE_RAS_STATUS_UNKNOWN_ERROR; + + return ras_status_to_errno_map[status]; +} + +static void prepare_ras_command(struct xe_sysctrl_mailbox_command *command, + u32 cmd_mask, void *request, size_t request_len, + void *response, size_t response_len) +{ + struct xe_sysctrl_app_msg_hdr hdr = {0}; + + hdr.data = FIELD_PREP(APP_HDR_GROUP_ID_MASK, XE_SYSCTRL_GROUP_GFSP) | + FIELD_PREP(APP_HDR_COMMAND_MASK, cmd_mask); + + command->header = hdr; + command->data_in = request; + command->data_in_len = request_len; + command->data_out = response; + command->data_out_len = response_len; +} + +static int send_page_offline(struct xe_device *xe, enum xe_ras_page_action action, u64 page_address) +{ + struct xe_sysctrl_mailbox_command command = {0}; + struct xe_ras_page_offline_request request = {0}; + struct xe_ras_page_offline_response response = {0}; + size_t rlen; + int ret; + + if (!xe->info.has_sysctrl) + return 0; + + if (action >= XE_RAS_PAGE_ACTION_MAX) { + xe_err(xe, "[RAS]: Invalid page offline action %d\n", action); + return -EINVAL; + } + + request.page_address = page_address; + request.action = action; + + prepare_ras_command(&command, XE_SYSCTRL_CMD_PAGE_OFFLINE, &request, + sizeof(request), &response, sizeof(response)); + + ret = xe_sysctrl_send_command(&xe->sc, &command, &rlen); + if (ret) { + xe_err(xe, "sysctrl: failed to send page offline command %d\n", ret); + return ret; + } + + if (rlen != sizeof(response)) { + xe_err(xe, "sysctrl: unexpected page offline response length %zu (expected %zu)\n", + rlen, sizeof(response)); + return -EINVAL; + } + + ret = ras_status_to_errno(response.status); + if (ret) + xe_err(xe, "sysctrl: page offline command failed with status %d\n", + response.status); + + return ret; +} + +static int handle_page_offline(struct xe_device *xe, u64 page_address, bool send_offline_cmd) +{ + enum xe_ras_page_action action; + int ret; + + if (!IS_ALIGNED(page_address, XE_PAGE_SIZE)) { + xe_err(xe, "sysctrl: Unaligned page address: 0x%llx\n", page_address); + return -EINVAL; + } + + /* + * TODO: Call function to handle address fault + * ret = xe_ttm_vram_handle_addr_fault(xe, page_address); + */ + + /* + * Handle return code from address fault handling function: + * 0: Address is valid and can be offlined + * -EIO: Address belongs to a critical BO that cannot be offlined + * -ENXIO: Invalid address + * -EOPNOTSUPP: Address is valid and can be offlined but user policy is not to offline + * + * For any other non-zero error code, skip offlining. + */ + + switch (ret) { + case 0: + action = XE_RAS_PAGE_ACTION_OFFLINE; + break; + /* User policy set to decline page offlining */ + case -EOPNOTSUPP: + action = XE_RAS_PAGE_ACTION_DECLINE; + break; + case -EIO: + xe_err(xe, "[RAS]: Page address belongs to critical BO: 0x%llx\n", + page_address); + return ret; + default: + xe_err(xe, "[RAS]: Failed to handle address fault 0x%llx: %d\n", + page_address, ret); + return 0; + } + + if (send_offline_cmd) { + ret = send_page_offline(xe, action, page_address); + if (ret) + xe_err(xe, "sysctrl: Failed to offline page for address 0x%llx: %d\n", + page_address, ret); + return ret; + } + + return 0; +} + +static enum xe_ras_recovery_action handle_core_compute_errors(struct xe_device *xe, + struct xe_ras_error_array *arr) +{ + struct xe_ras_compute_error *error_info = (struct xe_ras_compute_error *)arr->error_details; + u8 uncorr_type; + + uncorr_type = FIELD_GET(CORE_COMPUTE_UNCORR_TYPE, error_info->error_log_header); + + /* Request a reset if error is global */ + if (uncorr_type == GLOBAL_UNCORR_ERROR) + return XE_RAS_RECOVERY_ACTION_RESET; + + /* Local errors are recovered using an engine reset by GuC */ + return XE_RAS_RECOVERY_ACTION_RECOVERED; +} + +static enum xe_ras_recovery_action handle_soc_internal_errors(struct xe_device *xe, + struct xe_ras_error_array *arr) +{ + struct xe_ras_soc_error *error_info = (struct xe_ras_soc_error *)arr->error_details; + struct xe_ras_soc_error_source *source = &error_info->error_source; + struct xe_ras_error_class *error_class = &arr->error_class; + u8 tile_id = error_class->product.unit.tile; + struct xe_tile *tile; + + if (tile_id >= xe->info.tile_count) { + xe_err(xe, "sysctrl: SOC internal error reported from invalid tile %u\n", tile_id); + return XE_RAS_RECOVERY_ACTION_RESET; + } + + tile = &xe->tiles[tile_id]; + + if (source->csc) { + struct xe_ras_csc_error *csc_error = + (struct xe_ras_csc_error *)error_info->additional_details; + + /* + * CSC uncorrectable errors are classified as hardware errors and firmware errors. + * CSC firmware errors are critical errors that can be recovered only by firmware + * update via SPI driver. On a CSC firmware error, PCODE enables FDO mode and sets + * the bit in the capability register. On receiving this error, the driver enables + * runtime survivability mode which notifies userspace that a firmware update + * is required. + */ + if (csc_error->hec_uncorr_fw_err_dw0) { + xe_err(xe, "[RAS]: CSC %s detected: 0x%x\n", + sev_to_str(error_class->common.severity), + csc_error->hec_uncorr_fw_err_dw0); + schedule_work(&tile->csc_hw_error_work); + return XE_RAS_RECOVERY_ACTION_DISCONNECT; + } + } else if (source->ieh) { + struct xe_ras_ieh_error *ieh_error = + (struct xe_ras_ieh_error *)error_info->additional_details; + + if (ieh_error->global_error_status & XE_RAS_SOC_IEH_PUNIT) { + xe_err(xe, "[RAS]: PUNIT %s detected: 0x%x\n", + sev_to_str(error_class->common.severity), + ieh_error->global_error_status); + /* TODO: Add PUNIT error handling */ + return XE_RAS_RECOVERY_ACTION_DISCONNECT; + } + } + + /* For other SOC internal errors, request a reset as recovery mechanism */ + return XE_RAS_RECOVERY_ACTION_RESET; +} + +static enum xe_ras_recovery_action handle_device_memory_errors(struct xe_device *xe, + struct xe_ras_error_array *arr) +{ + struct xe_ras_memory_error *error_info = (struct xe_ras_memory_error *)arr->error_details; + int ret; + + if (error_info->category & XE_RAS_MEMORY_ECC) { + xe_err(xe, "[RAS]: double-bit ECC error detected at sw address 0x%llx\n", + error_info->sw_address); + ret = handle_page_offline(xe, error_info->sw_address, true); + if (!ret) + return XE_RAS_RECOVERY_ACTION_RECOVERED; + } + + /* Request a reset for other device memory errors and if page offlining failed */ + return XE_RAS_RECOVERY_ACTION_RESET; +} + +static void get_queued_pages(struct xe_device *xe) +{ + struct xe_sysctrl_mailbox_command command = {0}; + struct xe_ras_page_offline_queue response = {0}; + u32 count = 0; + size_t rlen; + int ret, i; + + /* Supported only on platforms with system controller */ + if (!xe->info.has_sysctrl) + return; + + prepare_ras_command(&command, XE_SYSCTRL_CMD_GET_OFFLINE_QUEUE, NULL, 0, + &response, sizeof(response)); + + do { + memset(&response, 0, sizeof(response)); + + ret = xe_sysctrl_send_command(&xe->sc, &command, &rlen); + if (ret) { + xe_err(xe, "sysctrl: failed to get page offline queue %d\n", ret); + return; + } + + if (rlen != sizeof(response)) { + xe_err(xe, "sysctrl: unexpected page offline queue response length %zu (expected %zu)\n", + rlen, sizeof(response)); + return; + } + + for (i = 0; i < response.pages_returned && i < XE_RAS_NUM_PAGES; i++) + handle_page_offline(xe, response.page_addresses[i], true); + + count += response.pages_returned; + if (count > response.total_pages) { + xe_err(xe, "sysctrl: Pages returned from queue exceed total pages %u, returned %u\n", + response.total_pages, count); + return; + } + } while (response.additional_data); +} + +static void get_offlined_list(struct xe_device *xe) +{ + struct xe_sysctrl_mailbox_command command = {0}; + struct xe_ras_page_offline_list response = {0}; + u32 count = 0; + size_t rlen; + int ret, i; + + /* Supported only on platforms with system controller */ + if (!xe->info.has_sysctrl) + return; + + prepare_ras_command(&command, XE_SYSCTRL_CMD_GET_OFFLINE_LIST, NULL, 0, + &response, sizeof(response)); + + do { + memset(&response, 0, sizeof(response)); + + ret = xe_sysctrl_send_command(&xe->sc, &command, &rlen); + if (ret) { + xe_err(xe, "sysctrl: failed to get page offline list %d\n", ret); + return; + } + + if (rlen != sizeof(response)) { + xe_err(xe, "sysctrl: unexpected page offline list response length %zu (expected %zu)\n", + rlen, sizeof(response)); + return; + } + + for (i = 0; i < response.pages_returned && i < XE_RAS_NUM_PAGES; i++) + handle_page_offline(xe, response.page_addresses[i], false); + + count += response.pages_returned; + if (count > response.total_pages) { + xe_err(xe, "sysctrl: Pages returned from list exceed total pages %u, returned %u\n", + response.total_pages, count); + return; + } + } while (response.additional_data); +} + void xe_ras_counter_threshold_crossed(struct xe_device *xe, struct xe_sysctrl_event_response *response) { @@ -91,3 +401,186 @@ void xe_ras_counter_threshold_crossed(struct xe_device *xe, comp_to_str(component), sev_to_str(severity)); } } + +/** + * xe_ras_process_errors() - Process and contain hardware errors + * @xe: xe device instance + * + * Get error details from system controller and return recovery + * method. Called only from PCI error handling. + * + * Returns: recovery action to be taken + */ +enum xe_ras_recovery_action xe_ras_process_errors(struct xe_device *xe) +{ + struct xe_sysctrl_mailbox_command command = {0}; + struct xe_ras_get_soc_error response; + enum xe_ras_recovery_action final_action; + u32 count = XE_SYSCTRL_FLOOD; + size_t rlen; + int ret; + + if (!xe->info.has_sysctrl) + return XE_RAS_RECOVERY_ACTION_RESET; + + /* Default action */ + final_action = XE_RAS_RECOVERY_ACTION_RECOVERED; + + prepare_ras_command(&command, XE_SYSCTRL_CMD_GET_SOC_ERROR, NULL, 0, + &response, sizeof(response)); + + do { + memset(&response, 0, sizeof(response)); + + ret = xe_sysctrl_send_command(&xe->sc, &command, &rlen); + if (ret) { + xe_err(xe, "sysctrl: failed to get soc error %d\n", ret); + goto err; + } + + if (rlen != sizeof(response)) { + xe_err(xe, "sysctrl: unexpected get soc error response length %zu (expected %zu)\n", + rlen, sizeof(response)); + goto err; + } + + /* Report if number of errors exceeds the maximum errors supported */ + if (response.num_errors > XE_RAS_NUM_ERROR_ARR) + xe_err(xe, "sysctrl: number of errors received %d out of bound (%d)\n", + response.num_errors, XE_RAS_NUM_ERROR_ARR); + + for (int i = 0; i < response.num_errors && i < XE_RAS_NUM_ERROR_ARR; i++) { + struct xe_ras_error_array *arr = &response.error_arr[i]; + enum xe_ras_recovery_action action; + struct xe_ras_error_class error_class; + u8 component, severity; + + error_class = arr->error_class; + component = error_class.common.component; + severity = error_class.common.severity; + + xe_err(xe, "[RAS]: %s %s detected\n", comp_to_str(component), + sev_to_str(severity)); + + switch (component) { + case XE_RAS_COMP_CORE_COMPUTE: + action = handle_core_compute_errors(xe, arr); + break; + case XE_RAS_COMP_SOC_INTERNAL: + action = handle_soc_internal_errors(xe, arr); + break; + case XE_RAS_COMP_DEVICE_MEMORY: + action = handle_device_memory_errors(xe, arr); + break; + default: + /* For any other component, reset */ + action = XE_RAS_RECOVERY_ACTION_RESET; + break; + } + + /* Process and log all errors and then trigger highest recovery action */ + if (action > final_action) + final_action = action; + } + + /* Treat flooding as an system controller error */ + if (!--count) { + xe_err(xe, "[RAS]: sysctrl: get soc error response flooding\n"); + return XE_RAS_RECOVERY_ACTION_RESET; + } + + } while (response.additional_errors); + + return final_action; + +err: + return XE_RAS_RECOVERY_ACTION_RESET; +} + +#ifdef CONFIG_PCIEAER +static void aer_unmask_and_downgrade_internal_error(struct xe_device *xe) +{ + struct pci_dev *pdev = to_pci_dev(xe->drm.dev); + struct pci_dev *vsp, *usp; + u32 aer_uncorr_mask, aer_uncorr_sev, aer_uncorr_status; + u16 aer_cap; + + /* + * Device Hierarchy: + * + * Upstream Switch Port (USP)--> Virtual Switch Port (VSP)--> SGunit (GPU endpoint) + */ + vsp = pci_upstream_bridge(pdev); + if (!vsp) + return; + + usp = pci_upstream_bridge(vsp); + if (!usp) + return; + + aer_cap = usp->aer_cap; + + if (!aer_cap) { + dev_info(&usp->dev, "USP doesn't support AER capability\n"); + return; + } + + /* + * Clear any stale Uncorrectable Internal Error Status event in Uncorrectable Error + * Status Register. + */ + pci_read_config_dword(usp, aer_cap + PCI_ERR_UNCOR_STATUS, &aer_uncorr_status); + if (aer_uncorr_status & PCI_ERR_UNC_INTN) + pci_write_config_dword(usp, aer_cap + PCI_ERR_UNCOR_STATUS, PCI_ERR_UNC_INTN); + + /* + * All errors are steered to USP which is a PCIe AER Compliant device. + * Downgrade all the errors to non-fatal to prevent PCIe bus driver + * from triggering a Secondary Bus Reset (SBR). This allows error + * detection, containment and recovery in the driver. + * + * The Uncorrectable Error Severity Register has the 'Uncorrectable + * Internal Error Severity' set to fatal by default. Set this to + * non-fatal and unmask the error. + */ + + /* Initialize Uncorrectable Error Severity Register */ + pci_read_config_dword(usp, aer_cap + PCI_ERR_UNCOR_SEVER, &aer_uncorr_sev); + aer_uncorr_sev &= ~PCI_ERR_UNC_INTN; + pci_write_config_dword(usp, aer_cap + PCI_ERR_UNCOR_SEVER, aer_uncorr_sev); + + /* Initialize Uncorrectable Error Mask Register */ + pci_read_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, &aer_uncorr_mask); + aer_uncorr_mask &= ~PCI_ERR_UNC_INTN; + pci_write_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, aer_uncorr_mask); + + pci_save_state(usp); + dev_dbg(&usp->dev, "Uncorrectable Internal Errors downgraded and unmasked\n"); +} +#endif + +/** + * xe_ras_init - Initialize Xe RAS + * @xe: xe device instance + * + * Initialize Xe RAS + */ +void xe_ras_init(struct xe_device *xe) +{ + if (!xe->info.has_sysctrl || IS_SRIOV_VF(xe)) + return; + +#ifdef CONFIG_PCIEAER + aer_unmask_and_downgrade_internal_error(xe); +#endif + + get_queued_pages(xe); + get_offlined_list(xe); + + /* + * On init, process and log any errors detected by firmware before driver load. + * Critical errors such as Punit, CSC are reported through PCode init failure, + * causing the driver to enter survivability mode. + */ + xe_ras_process_errors(xe); +} diff --git a/drivers/gpu/drm/xe/xe_ras.h b/drivers/gpu/drm/xe/xe_ras.h index ea90593b62dc..cdaad3114dae 100644 --- a/drivers/gpu/drm/xe/xe_ras.h +++ b/drivers/gpu/drm/xe/xe_ras.h @@ -6,10 +6,13 @@ #ifndef _XE_RAS_H_ #define _XE_RAS_H_ +#include "xe_ras_types.h" + struct xe_device; struct xe_sysctrl_event_response; void xe_ras_counter_threshold_crossed(struct xe_device *xe, struct xe_sysctrl_event_response *response); - +void xe_ras_init(struct xe_device *xe); +enum xe_ras_recovery_action xe_ras_process_errors(struct xe_device *xe); #endif diff --git a/drivers/gpu/drm/xe/xe_ras_types.h b/drivers/gpu/drm/xe/xe_ras_types.h index 4e63c67f806a..3ec64b4f5a17 100644 --- a/drivers/gpu/drm/xe/xe_ras_types.h +++ b/drivers/gpu/drm/xe/xe_ras_types.h @@ -8,7 +8,63 @@ #include <linux/types.h> +#define XE_RAS_NUM_ERROR_ARR 3 #define XE_RAS_NUM_COUNTERS 16 +#define XE_RAS_SOC_IEH_PUNIT BIT(1) +#define XE_RAS_MEMORY_ECC BIT(1) +#define XE_RAS_NUM_PAGES 25 + +/** + * enum xe_ras_recovery_action - RAS recovery actions + * + * @XE_RAS_RECOVERY_ACTION_RECOVERED: Error recovered + * @XE_RAS_RECOVERY_ACTION_RESET: Requires reset + * @XE_RAS_RECOVERY_ACTION_DISCONNECT: Requires disconnect + * @XE_RAS_RECOVERY_ACTION_MAX: Max action value + * + * This enum defines the possible recovery actions that can be taken in response + * to RAS errors. + */ +enum xe_ras_recovery_action { + XE_RAS_RECOVERY_ACTION_RECOVERED = 0, + XE_RAS_RECOVERY_ACTION_RESET, + XE_RAS_RECOVERY_ACTION_DISCONNECT, + XE_RAS_RECOVERY_ACTION_MAX +}; + +/** + * enum xe_ras_page_action - Page offline actions for page offline request + * + * @XE_RAS_PAGE_ACTION_OFFLINE: Instruct firmware to remove page from queue + * @XE_RAS_PAGE_ACTION_DECLINE: Instruct firmware to mark page as not offline + * @XE_RAS_PAGE_ACTION_MAX: Max value for validation + */ +enum xe_ras_page_action { + XE_RAS_PAGE_ACTION_OFFLINE, + XE_RAS_PAGE_ACTION_DECLINE, + XE_RAS_PAGE_ACTION_MAX +}; + +/** + * enum xe_ras_response_status - RAS response status codes + * + * @XE_RAS_STATUS_SUCCESS: Operation successful + * @XE_RAS_STATUS_INVALID_PARAM: Invalid parameter + * @XE_RAS_STATUS_OP_NOT_SUPPORTED: Operation not supported + * @XE_RAS_STATUS_TIMEOUT: Operation timed out + * @XE_RAS_STATUS_HARDWARE_FAILURE: Hardware failure + * @XE_RAS_STATUS_INSUFFICIENT_RESOURCES: Insufficient resources + * @XE_RAS_STATUS_UNKNOWN_ERROR: Unknown error + */ +enum xe_ras_response_status { + XE_RAS_STATUS_SUCCESS = 0, + XE_RAS_STATUS_INVALID_PARAM, + XE_RAS_STATUS_OP_NOT_SUPPORTED, + XE_RAS_STATUS_TIMEOUT, + XE_RAS_STATUS_HARDWARE_FAILURE, + XE_RAS_STATUS_INSUFFICIENT_RESOURCES, + XE_RAS_STATUS_UNKNOWN_ERROR +}; /** * struct xe_ras_error_common - Error fields that are common across all products @@ -70,4 +126,163 @@ struct xe_ras_threshold_crossed { struct xe_ras_error_class counters[XE_RAS_NUM_COUNTERS]; } __packed; +/** + * struct xe_ras_error_array - Details of the error types + */ +struct xe_ras_error_array { + /** @counter_value: Counter value of the returned error */ + u32 counter_value; + /** @error_class: Error class */ + struct xe_ras_error_class error_class; + /** @timestamp: Timestamp */ + u64 timestamp; + /** @error_details: Error details specific to the class */ + u32 error_details[XE_RAS_NUM_COUNTERS]; +} __packed; + +/** + * struct xe_ras_get_soc_error - Response from get soc error command + */ +struct xe_ras_get_soc_error { + /** @num_errors: Number of errors reported in this response */ + u8 num_errors; + /** @additional_errors: Indicates if the errors are pending */ + u8 additional_errors; + /** @error_arr: Array of up to 3 errors */ + struct xe_ras_error_array error_arr[XE_RAS_NUM_ERROR_ARR]; +} __packed; + +/** + * struct xe_ras_compute_error - Error details of Core Compute error + */ +struct xe_ras_compute_error { + /** @error_log_header: Error Source and type */ + u32 error_log_header; + /** @reserved: Reserved */ + u32 reserved[15]; +} __packed; + +/** + * struct xe_ras_soc_error_source - Source of SoC error + */ +struct xe_ras_soc_error_source { + /** @csc: CSC */ + u32 csc:1; + /** @ieh: IEH (Integrated Error Handler) */ + u32 ieh:1; + /** @reserved: Reserved for future use */ + u32 reserved:30; +} __packed; + +/** + * struct xe_ras_soc_error - Error details of SoC internal error + */ +struct xe_ras_soc_error { + /** @error_source: Error source */ + struct xe_ras_soc_error_source error_source; + /** @additional_details: Additional details */ + u32 additional_details[15]; +} __packed; + +/** + * struct xe_ras_csc_error - CSC error details + */ +struct xe_ras_csc_error { + /** @hec_uncorr_err_status: CSC hardware error status */ + u32 hec_uncorr_err_status; + /** @hec_uncorr_fw_err_dw0: CSC firmware error */ + u32 hec_uncorr_fw_err_dw0; +} __packed; + +/** + * struct xe_ras_ieh_error - SoC IEH (Integrated Error Handler) error details + */ +struct xe_ras_ieh_error { + /** @ieh_instance: IEH instance */ + u32 ieh_instance:2; + /** @reserved: Reserved for future use */ + u32 reserved:30; + /** @global_error_status: Global error status */ + u32 global_error_status; + /** @local_error_status: Local error status */ + u32 local_error_status; + /** @gerr_mask: Global error mask */ + u32 gerr_mask; + /** @additional_info: Additional information */ + u32 additional_info[10]; +} __packed; + +/** + * struct xe_ras_memory_error - Device memory error details + */ +struct xe_ras_memory_error { + /** @category: Device memory error category */ + u8 category; + /** @reserved: Reserved for future use */ + u8 reserved[7]; + /** @hardware_address: Hardware physical address details */ + u64 hardware_address; + /** @sw_address: Software address where error occurred */ + u64 sw_address; + /** @reserved2: Reserved for future use */ + u32 reserved2[10]; +} __packed; + +/** + * struct xe_ras_page_offline_list - Response from get offline list command + */ +struct xe_ras_page_offline_list { + /** @max_entries: Total no of pages that can be stored in flash */ + u32 max_entries; + /** @total_pages: Total number of permanently offlined pages */ + u32 total_pages; + /** @pages_returned: Number of pages returned in this response */ + u32 pages_returned; + /** @page_addresses: Array of permanently offlined page addresses (4KB aligned) */ + u64 page_addresses[XE_RAS_NUM_PAGES]; + /** @additional_data: Indicates if more data is available */ + u8 additional_data; + /** @reserved: Reserved for future use */ + u8 reserved[3]; +} __packed; + +/** + * struct xe_ras_page_offline_queue - Response from get offline queue command + */ +struct xe_ras_page_offline_queue { + /** @total_pages: Total number of queued pages */ + u32 total_pages; + /** @pages_returned: Number of pages returned in this response */ + u32 pages_returned; + /** @page_addresses: Array of page addresses (4KB aligned) */ + u64 page_addresses[XE_RAS_NUM_PAGES]; + /** @additional_data: Indicates if more data is available */ + u8 additional_data; + /** @reserved: Reserved for future use */ + u8 reserved[3]; +} __packed; + +/** + * struct xe_ras_page_offline_request - Request for page offline command + * + * This structure provides the request format to offline/decline a page + */ +struct xe_ras_page_offline_request { + /** @page_address: Page address (4KB aligned) */ + u64 page_address; + /** @action: Action to be performed, see &enum xe_ras_page_action */ + u32 action; + /** @reserved: Reserved for future use */ + u32 reserved; +} __packed; + +/** + * struct xe_ras_page_offline_response - Response from page offline command + */ +struct xe_ras_page_offline_response { + /** @status: Status of the page offline request, see &enum xe_ras_response_status */ + u32 status; + /** @reserved: Reserved for future use */ + u32 reserved; +} __packed; #endif diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c b/drivers/gpu/drm/xe/xe_survivability_mode.c index 427afd144f3a..4c506027fa94 100644 --- a/drivers/gpu/drm/xe/xe_survivability_mode.c +++ b/drivers/gpu/drm/xe/xe_survivability_mode.c @@ -54,7 +54,6 @@ * # cat /sys/bus/pci/devices/<device>/survivability_mode * Boot * - * * Any additional debug information if present will be visible under the directory * ``survivability_info``:: * @@ -98,6 +97,15 @@ * # cat /sys/bus/pci/devices/<device>/survivability_mode * Runtime * + * On some CSC firmware errors, PCODE sets FDO mode and the only recovery possible is through + * firmware flash using SPI driver. Userspace can check if FDO mode is set by checking the below + * sysfs entry. + * + * .. code-block:: shell + * + * # cat /sys/bus/pci/devices/<device>/survivability_info/fdo_mode + * enabled + * * When such errors occur, userspace is notified with the drm device wedged uevent and runtime * survivability mode. User can then initiate a firmware flash using userspace tools like fwupd * to restore device to normal operation. @@ -296,7 +304,8 @@ static int create_survivability_sysfs(struct pci_dev *pdev) if (ret) return ret; - if (check_boot_failure(xe)) { + /* Survivability info is not required if enabled via configfs */ + if (!xe_configfs_get_survivability_mode(pdev)) { ret = devm_device_add_group(dev, &survivability_info_group); if (ret) return ret; diff --git a/drivers/gpu/drm/xe/xe_sysctrl_event.c b/drivers/gpu/drm/xe/xe_sysctrl_event.c index b4d17329af6c..faf6ba89ce98 100644 --- a/drivers/gpu/drm/xe/xe_sysctrl_event.c +++ b/drivers/gpu/drm/xe/xe_sysctrl_event.c @@ -16,7 +16,7 @@ static void get_pending_event(struct xe_sysctrl *sc, struct xe_sysctrl_mailbox_c { struct xe_sysctrl_event_response *response = command->data_out; struct xe_device *xe = sc_to_xe(sc); - u32 count = XE_SYSCTRL_EVENT_FLOOD; + u32 count = XE_SYSCTRL_FLOOD; size_t len; int ret; diff --git a/drivers/gpu/drm/xe/xe_sysctrl_event_types.h b/drivers/gpu/drm/xe/xe_sysctrl_event_types.h index c16c66b9fa7f..d236e22fe9dd 100644 --- a/drivers/gpu/drm/xe/xe_sysctrl_event_types.h +++ b/drivers/gpu/drm/xe/xe_sysctrl_event_types.h @@ -11,7 +11,7 @@ #define XE_SYSCTRL_EVENT_DATA_LEN 59 /* Modify as needed */ -#define XE_SYSCTRL_EVENT_FLOOD 16 +#define XE_SYSCTRL_FLOOD 16 /** * enum xe_sysctrl_event - Events reported by System Controller diff --git a/drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h b/drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h index 84d7c647e743..12ffd011ee8e 100644 --- a/drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h +++ b/drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h @@ -22,10 +22,18 @@ enum xe_sysctrl_group { /** * enum xe_sysctrl_gfsp_cmd - Commands supported by GFSP group * + * @XE_SYSCTRL_CMD_GET_SOC_ERROR: Retrieve basic error information * @XE_SYSCTRL_CMD_GET_PENDING_EVENT: Retrieve pending event + * @XE_SYSCTRL_CMD_PAGE_OFFLINE: Instruct firmware to offline/decline a page + * @XE_SYSCTRL_CMD_GET_OFFLINE_LIST: Retrieve list of all offlined pages from flash + * @XE_SYSCTRL_CMD_GET_OFFLINE_QUEUE: Retrieve list of offlined queued pages from firmware */ enum xe_sysctrl_gfsp_cmd { + XE_SYSCTRL_CMD_GET_SOC_ERROR = 0x01, XE_SYSCTRL_CMD_GET_PENDING_EVENT = 0x07, + XE_SYSCTRL_CMD_PAGE_OFFLINE = 0x08, + XE_SYSCTRL_CMD_GET_OFFLINE_LIST = 0x09, + XE_SYSCTRL_CMD_GET_OFFLINE_QUEUE = 0x0A, }; /** @@ -48,6 +56,9 @@ struct xe_sysctrl_mailbox_command { size_t data_out_len; }; +/* Modify as needed */ +#define XE_SYSCTRL_FLOOD 16 + #define XE_SYSCTRL_MB_FRAME_SIZE 16 #define XE_SYSCTRL_MB_MAX_FRAMES 64 #define XE_SYSCTRL_MB_MAX_MESSAGE_SIZE \ -- 2.34.1 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Claude review: Introduce Xe Uncorrectable Error Handling 2026-05-20 11:33 ` [PATCH v6 1/5] Introduce Xe Uncorrectable Error Handling Mallesh Koujalagi @ 2026-05-25 11:57 ` Claude Code Review Bot 0 siblings, 0 replies; 14+ messages in thread From: Claude Code Review Bot @ 2026-05-25 11:57 UTC (permalink / raw) To: dri-devel-reviews Patch Review *Marked "DO NOT REVIEW. COMPILATION ONLY" — flagging issues for completeness since this code does compile and execute.* **BUG: Uninitialized variable in `handle_page_offline()`** ```c static int handle_page_offline(struct xe_device *xe, u64 page_address, bool send_offline_cmd) { enum xe_ras_page_action action; int ret; ... /* * TODO: Call function to handle address fault * ret = xe_ttm_vram_handle_addr_fault(xe, page_address); */ switch (ret) { ``` `ret` is never assigned before the `switch`. The TODO function call is commented out, but the code that consumes its return value is live. This is undefined behavior — whatever happens to be on the stack determines the control flow. The compiler should warn about this. **Wedge API bypass in `xe_pci_error_handling()`** ```c static void xe_pci_error_handling(struct pci_dev *pdev) { ... xe_device_set_in_reset(xe); atomic_set(&xe->wedged.flag, 1); for_each_gt(gt, xe, id) xe_gt_declare_wedged(gt); ``` This bypasses `xe_device_declare_wedged()` and directly sets the flag. `xe_gt_declare_wedged()` asserts `xe_gt_assert(gt, gt_to_xe(gt)->wedged.mode)` at `xe_gt.c:1190`. If `wedged.mode` is `XE_WEDGED_MODE_NEVER` (0), this assertion fires. The normal `xe_device_declare_wedged()` path checks for this and returns early. This error-handling path should respect that check or explicitly set `wedged.mode` before calling `xe_gt_declare_wedged()`. **Duplicate macro definition** `XE_SYSCTRL_FLOOD` is defined as `16` in both `xe_sysctrl_event_types.h` and `xe_sysctrl_mailbox_types.h`. It should be in one place only — likely `xe_sysctrl_mailbox_types.h` since that's where the mailbox infrastructure lives, and `xe_sysctrl_event.c` can include that header. **Missing header declaration** ```c extern const struct pci_error_handlers xe_pci_error_handlers; ``` This `extern` in `xe_pci.c` should be in a header (e.g., `xe_pci.h` or a new `xe_pci_error.h`) so the definition in `xe_pci_error.c` and the usage in `xe_pci.c` share one declaration. A bare `extern` in a `.c` file is fragile — if the type ever changes, the compiler can't catch the mismatch. **`xe_pci_error_slot_reset()` — unchecked `pci_match_id()` return** ```c const struct pci_device_id *ent = pci_match_id(pdev->driver->id_table, pdev); ``` `pci_match_id()` can return NULL. If it does, `pdev->driver->probe(pdev, ent)` passes NULL as the `ent` argument, which will likely crash. While this shouldn't happen during normal error recovery (the device already matched), defensive code should check. --- --- Generated by Claude Code Patch Reviewer ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v6 2/5] drm: Add DRM_WEDGE_RECOVERY_COLD_RESET recovery method 2026-05-20 11:33 [PATCH v6 0/5] Introduce cold reset recovery method Mallesh Koujalagi 2026-05-20 11:33 ` [PATCH v6 1/5] Introduce Xe Uncorrectable Error Handling Mallesh Koujalagi @ 2026-05-20 11:33 ` Mallesh Koujalagi 2026-05-25 11:57 ` Claude review: " Claude Code Review Bot 2026-05-20 11:33 ` [PATCH v6 3/5] drm/doc: Document " Mallesh Koujalagi ` (3 subsequent siblings) 5 siblings, 1 reply; 14+ messages in thread From: Mallesh Koujalagi @ 2026-05-20 11:33 UTC (permalink / raw) To: intel-xe, dri-devel, rodrigo.vivi Cc: andrealmeid, christian.koenig, airlied, simona.vetter, mripard, maarten.lankhorst, tzimmermann, anshuman.gupta, badal.nilawar, riana.tauro, karthik.poosa, sk.anirban, raag.jadav, Mallesh Koujalagi Introduce DRM_WEDGE_RECOVERY_COLD_RESET (BIT(4)) recovery method to handle scenarios requiring device power cycle. This method addresses cases where other recovery mechanisms (driver reload, PCIe reset, etc.) are insufficient to restore device functionality. When set, it indicates to userspace that only device power cycle can recover the device from its current error state. Signed-off-by: Mallesh Koujalagi <mallesh.koujalagi@intel.com> Reviewed-by: Raag Jadav <raag.jadav@intel.com> --- v3: - Update any scenario that requires cold-reset. (Riana) v4: - Rename cold reset to power cyclce. (Raag) v5: - Make it consistent with consumer expectations. (Raag) v6: - Drop "last resort" wording. (Riana) --- drivers/gpu/drm/drm_drv.c | 2 ++ include/drm/drm_device.h | 1 + 2 files changed, 3 insertions(+) diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c index 675675480da4..19c25366635e 100644 --- a/drivers/gpu/drm/drm_drv.c +++ b/drivers/gpu/drm/drm_drv.c @@ -535,6 +535,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt) return "bus-reset"; case DRM_WEDGE_RECOVERY_VENDOR: return "vendor-specific"; + case DRM_WEDGE_RECOVERY_COLD_RESET: + return "cold-reset"; default: return NULL; } diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h index 768a8dae83c5..75f030d027ee 100644 --- a/include/drm/drm_device.h +++ b/include/drm/drm_device.h @@ -37,6 +37,7 @@ struct pci_controller; #define DRM_WEDGE_RECOVERY_REBIND BIT(1) /* unbind + bind driver */ #define DRM_WEDGE_RECOVERY_BUS_RESET BIT(2) /* unbind + reset bus device + bind */ #define DRM_WEDGE_RECOVERY_VENDOR BIT(3) /* vendor specific recovery method */ +#define DRM_WEDGE_RECOVERY_COLD_RESET BIT(4) /* remove device + slot power cycle + rescan */ /** * struct drm_wedge_task_info - information about the guilty task of a wedge dev -- 2.34.1 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Claude review: drm: Add DRM_WEDGE_RECOVERY_COLD_RESET recovery method 2026-05-20 11:33 ` [PATCH v6 2/5] drm: Add DRM_WEDGE_RECOVERY_COLD_RESET recovery method Mallesh Koujalagi @ 2026-05-25 11:57 ` Claude Code Review Bot 0 siblings, 0 replies; 14+ messages in thread From: Claude Code Review Bot @ 2026-05-25 11:57 UTC (permalink / raw) To: dri-devel-reviews Patch Review Clean and minimal. The new `BIT(4)` fits the existing pattern and the string mapping in `drm_get_wedge_recovery()` is correct. ```c #define DRM_WEDGE_RECOVERY_COLD_RESET BIT(4) /* remove device + slot power cycle + rescan */ ``` No issues. --- --- Generated by Claude Code Patch Reviewer ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v6 3/5] drm/doc: Document DRM_WEDGE_RECOVERY_COLD_RESET recovery method 2026-05-20 11:33 [PATCH v6 0/5] Introduce cold reset recovery method Mallesh Koujalagi 2026-05-20 11:33 ` [PATCH v6 1/5] Introduce Xe Uncorrectable Error Handling Mallesh Koujalagi 2026-05-20 11:33 ` [PATCH v6 2/5] drm: Add DRM_WEDGE_RECOVERY_COLD_RESET recovery method Mallesh Koujalagi @ 2026-05-20 11:33 ` Mallesh Koujalagi 2026-05-25 11:57 ` Claude review: " Claude Code Review Bot 2026-05-20 11:33 ` [PATCH v6 4/5] drm/xe: Handle PUNIT errors by requesting cold-reset recovery Mallesh Koujalagi ` (2 subsequent siblings) 5 siblings, 1 reply; 14+ messages in thread From: Mallesh Koujalagi @ 2026-05-20 11:33 UTC (permalink / raw) To: intel-xe, dri-devel, rodrigo.vivi Cc: andrealmeid, christian.koenig, airlied, simona.vetter, mripard, maarten.lankhorst, tzimmermann, anshuman.gupta, badal.nilawar, riana.tauro, karthik.poosa, sk.anirban, raag.jadav, Mallesh Koujalagi When ``WEDGED=cold-reset`` is sent, it indicates that the device has encountered an error condition that cannot be resolved through other recovery methods such as driver rebind or bus reset, and requires a complete device power cycle to restore functionality. Signed-off-by: Mallesh Koujalagi <mallesh.koujalagi@intel.com> --- v2: - Add several instead of number to avoid update. (Jani) v3: - Update document with generic scenario. (Riana) - Consistent with terminology. (Raag) - Remove already covered information. v4: - Update doc. (Raag/Riana) - Change commit message. v5: - Update commit message. (Raag) - Remove unbind. - Simplify cold-reset script. v6: - Look up the hotplug slot in DEVPATH instead of scanning every PCI slot on the system. (Raag) - Drop arbitrary sleep values from the example script. --- Documentation/gpu/drm-uapi.rst | 60 +++++++++++++++++++++++++++++++++- 1 file changed, 59 insertions(+), 1 deletion(-) diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst index 2c2f939322fb..665c66782504 100644 --- a/Documentation/gpu/drm-uapi.rst +++ b/Documentation/gpu/drm-uapi.rst @@ -424,7 +424,7 @@ needed. Recovery -------- -Current implementation defines four recovery methods, out of which, drivers +Current implementation defines several recovery methods, out of which, drivers can use any one, multiple or none. Method(s) of choice will be sent in the uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to more side-effects. See the section `Vendor Specific Recovery`_ @@ -441,6 +441,7 @@ following expectations. rebind unbind + bind driver bus-reset unbind + bus reset/re-enumeration + bind vendor-specific vendor specific recovery method + cold-reset remove device + slot power cycle + rescan unknown consumer policy =============== ======================================== @@ -453,6 +454,17 @@ debug purpose in order to root cause the hang. This is useful because the first hang is usually the most critical one which can result in consequential hangs or complete wedging. +Cold Reset Recovery +------------------- + +When ``WEDGED=cold-reset`` is sent, it indicates that the device has +encountered an error condition that cannot be resolved through other +recovery methods such as driver rebind or bus reset, and requires a complete +device power cycle to restore functionality. + +This method is used by devices that are plugged directly into the PCIe slot +which supports removing the power. + Vendor Specific Recovery ------------------------ @@ -530,6 +542,52 @@ Recovery script:: echo -n $DEVICE > $DRIVER/unbind echo -n $DEVICE > $DRIVER/bind +Example - cold-reset +-------------------- + +Udev rule:: + + SUBSYSTEM=="drm", ENV{WEDGED}=="cold-reset", DEVPATH=="*/drm/card[0-9]", + RUN+="/path/to/cold-reset.sh $env{DEVPATH}" + +Recovery script:: + + #!/bin/sh + + [ -z "$1" ] && echo "Usage: $0 <device-path>" && exit 1 + + # Get device + DEVPATH=$(readlink -f /sys/$1/device 2>/dev/null || readlink -f /sys/$1) + DEVICE=$(basename $DEVPATH) + + echo "Cold reset: $DEVICE" + + # The PCI core exposes a 'slot' symlink on the device that sits in a + # registered hotplug slot. Use it directly instead of scanning every + # slot on the system. + SLOT="" + if [ -L "$DEVPATH/slot" ]; then + SLOT=$(basename "$(readlink -f "$DEVPATH/slot")") + fi + + if [ -n "$SLOT" ]; then + echo "Using slot $SLOT" + + # Remove device + echo 1 > /sys/bus/pci/devices/$DEVICE/remove + + # Power cycle slot. A platform-specific settle delay may be required + # between power off and power on; tune to the hardware as needed. + echo 0 > /sys/bus/pci/slots/$SLOT/power + echo 1 > /sys/bus/pci/slots/$SLOT/power + + # Rescan + echo 1 > /sys/bus/pci/rescan + echo "Done!" + else + echo "No slot found" + fi + Customization ------------- -- 2.34.1 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Claude review: drm/doc: Document DRM_WEDGE_RECOVERY_COLD_RESET recovery method 2026-05-20 11:33 ` [PATCH v6 3/5] drm/doc: Document " Mallesh Koujalagi @ 2026-05-25 11:57 ` Claude Code Review Bot 0 siblings, 0 replies; 14+ messages in thread From: Claude Code Review Bot @ 2026-05-25 11:57 UTC (permalink / raw) To: dri-devel-reviews Patch Review Good documentation. The example script is practical and the v6 improvement (looking up the slot via `DEVPATH/slot` symlink instead of scanning all slots) is a real improvement. Minor nits: - The script variables (`$DEVICE`, `$DEVPATH`) are unquoted in most uses. PCI paths don't contain spaces so this is fine in practice, but the leading `readlink -f /sys/$1/device` on line 1 should be `readlink -f "/sys/$1/device"` for robustness. - The `||` fallback in the script is slightly confusing: ``` DEVPATH=$(readlink -f /sys/$1/device 2>/dev/null || readlink -f /sys/$1) ``` If the first `readlink` fails (not a card subdirectory), it falls back to `readlink -f /sys/$1`. This seems intentional but isn't documented. These are minor — the doc serves its purpose well. --- --- Generated by Claude Code Patch Reviewer ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v6 4/5] drm/xe: Handle PUNIT errors by requesting cold-reset recovery 2026-05-20 11:33 [PATCH v6 0/5] Introduce cold reset recovery method Mallesh Koujalagi ` (2 preceding siblings ...) 2026-05-20 11:33 ` [PATCH v6 3/5] drm/doc: Document " Mallesh Koujalagi @ 2026-05-20 11:33 ` Mallesh Koujalagi 2026-05-25 11:57 ` Claude review: " Claude Code Review Bot 2026-05-20 11:33 ` [PATCH v6 5/5] drm/xe: Suppress Surprise Link Down on non-hotplug device Mallesh Koujalagi 2026-05-25 11:57 ` Claude review: Introduce cold reset recovery method Claude Code Review Bot 5 siblings, 1 reply; 14+ messages in thread From: Mallesh Koujalagi @ 2026-05-20 11:33 UTC (permalink / raw) To: intel-xe, dri-devel, rodrigo.vivi Cc: andrealmeid, christian.koenig, airlied, simona.vetter, mripard, maarten.lankhorst, tzimmermann, anshuman.gupta, badal.nilawar, riana.tauro, karthik.poosa, sk.anirban, raag.jadav, Mallesh Koujalagi When PUNIT (power management unit) errors are detected that persist across warm resets, mark the device as wedged with DRM_WEDGE_RECOVERY_COLD_RESET and notify userspace that a complete device power cycle is required to restore normal operation. Signed-off-by: Mallesh Koujalagi <mallesh.koujalagi@intel.com> Reviewed-by: Raag Jadav <raag.jadav@intel.com> --- v3: - Use PUNIT instead of PMU. (Riana) - Use consistent wordingi. - Remove log. (Raag) v4: - Make function static. (Raag) v5: - Remove kdoc for static function. (Raag) - Remove xe_ prefix for static function. --- drivers/gpu/drm/xe/xe_ras.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c index d79f8a6589ac..604470565bf3 100644 --- a/drivers/gpu/drm/xe/xe_ras.c +++ b/drivers/gpu/drm/xe/xe_ras.c @@ -4,6 +4,8 @@ */ #include "xe_bo.h" +#include "xe_assert.h" +#include "xe_device_types.h" #include "xe_device.h" #include "xe_printk.h" #include "xe_ras.h" @@ -222,6 +224,12 @@ static enum xe_ras_recovery_action handle_core_compute_errors(struct xe_device * return XE_RAS_RECOVERY_ACTION_RECOVERED; } +static void punit_error_handler(struct xe_device *xe) +{ + xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_COLD_RESET); + xe_device_declare_wedged(xe); +} + static enum xe_ras_recovery_action handle_soc_internal_errors(struct xe_device *xe, struct xe_ras_error_array *arr) { @@ -265,7 +273,7 @@ static enum xe_ras_recovery_action handle_soc_internal_errors(struct xe_device * xe_err(xe, "[RAS]: PUNIT %s detected: 0x%x\n", sev_to_str(error_class->common.severity), ieh_error->global_error_status); - /* TODO: Add PUNIT error handling */ + punit_error_handler(xe); return XE_RAS_RECOVERY_ACTION_DISCONNECT; } } -- 2.34.1 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Claude review: drm/xe: Handle PUNIT errors by requesting cold-reset recovery 2026-05-20 11:33 ` [PATCH v6 4/5] drm/xe: Handle PUNIT errors by requesting cold-reset recovery Mallesh Koujalagi @ 2026-05-25 11:57 ` Claude Code Review Bot 0 siblings, 0 replies; 14+ messages in thread From: Claude Code Review Bot @ 2026-05-25 11:57 UTC (permalink / raw) To: dri-devel-reviews Patch Review The core change is correct: ```c static void punit_error_handler(struct xe_device *xe) { xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_COLD_RESET); xe_device_declare_wedged(xe); } ``` This properly uses the wedge API (unlike the bypass in patch 1's PCI error handling path). The TODO comment it replaces is cleanly removed. **Unnecessary includes:** ```c +#include "xe_assert.h" +#include "xe_device_types.h" ``` Neither `xe_assert.h` nor `xe_device_types.h` appear to be needed by the changes in this patch. `xe_device.h` (already included) transitively provides `xe_device_types.h`, and nothing in this diff uses `xe_assert`. These may have been added speculatively for patch 5, but each patch should only add headers it actually needs. If patch 5 needs them, add them there. --- --- Generated by Claude Code Patch Reviewer ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v6 5/5] drm/xe: Suppress Surprise Link Down on non-hotplug device 2026-05-20 11:33 [PATCH v6 0/5] Introduce cold reset recovery method Mallesh Koujalagi ` (3 preceding siblings ...) 2026-05-20 11:33 ` [PATCH v6 4/5] drm/xe: Handle PUNIT errors by requesting cold-reset recovery Mallesh Koujalagi @ 2026-05-20 11:33 ` Mallesh Koujalagi 2026-05-25 11:57 ` Claude review: " Claude Code Review Bot 2026-05-25 11:57 ` Claude review: Introduce cold reset recovery method Claude Code Review Bot 5 siblings, 1 reply; 14+ messages in thread From: Mallesh Koujalagi @ 2026-05-20 11:33 UTC (permalink / raw) To: intel-xe, dri-devel, rodrigo.vivi Cc: andrealmeid, christian.koenig, airlied, simona.vetter, mripard, maarten.lankhorst, tzimmermann, anshuman.gupta, badal.nilawar, riana.tauro, karthik.poosa, sk.anirban, raag.jadav, Mallesh Koujalagi A PUNIT (power management unit) error recovery on GPUs triggers a power cycle (cold reset). On platforms where the upstream port is not hotplug capable, the brief link drop caused by powering the device off and back on is reported by hardware as a Surprise Link Down (SLD), which AER then escalates as an Uncorrectable Fatal Error. That error fires before the device finishes coming back up and defeats the very recovery we are attempting. To keep the expected, recovery-induced link drop from being raised as a fatal AER event, mask the Surprise Link Down bit (PCI_ERR_UNC_SURPDN) in the upstream port's AER Uncorrectable Error Mask register before punit_error_handler() requests the cold reset. The mask is only applied when the slot is not hotplug capable. Signed-off-by: Mallesh Koujalagi <mallesh.koujalagi@intel.com> --- v6: - Expand commit message to explain why SUR_DN is masked. (Raag/Riana) - Check Slot Implemented bit before reading Slot Capabilities, per PCIe spec. (Riana) - Add debug log. --- drivers/gpu/drm/xe/xe_ras.c | 66 +++++++++++++++++++++++++++++++++++++ 1 file changed, 66 insertions(+) diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c index 604470565bf3..f1a6d7b23c93 100644 --- a/drivers/gpu/drm/xe/xe_ras.c +++ b/drivers/gpu/drm/xe/xe_ras.c @@ -224,8 +224,74 @@ static enum xe_ras_recovery_action handle_core_compute_errors(struct xe_device * return XE_RAS_RECOVERY_ACTION_RECOVERED; } +#ifdef CONFIG_PCIEAER +static bool pcie_slot_is_hotplug_capable(struct pci_dev *usp) +{ + struct pci_dev *root_port = pci_upstream_bridge(usp); + u32 sltcap; + u16 flags; + + if (!root_port) + return false; + + /* + * Per PCIe spec, the Slot Capabilities register contents are + * undefined unless the Slot Implemented bit in the PCI Express + * Capabilities register is set. Check it before reading SLTCAP. + */ + if (pcie_capability_read_word(root_port, PCI_EXP_FLAGS, &flags)) + return false; + + if (!(flags & PCI_EXP_FLAGS_SLOT)) + return false; + + if (pcie_capability_read_dword(root_port, PCI_EXP_SLTCAP, &sltcap)) + return false; + + return (sltcap & (PCI_EXP_SLTCAP_HPC | PCI_EXP_SLTCAP_PCP)) == + (PCI_EXP_SLTCAP_HPC | PCI_EXP_SLTCAP_PCP); +} + +static void pcie_suppress_surprise_link_down(struct pci_dev *usp) +{ + u32 aer_uncorr_mask; + u16 aer_cap; + + aer_cap = usp->aer_cap; + if (!aer_cap) { + dev_dbg(&usp->dev, + "AER capability not present; cannot mask Surprise Link Down for cold reset\n"); + return; + } + + pci_read_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, &aer_uncorr_mask); + aer_uncorr_mask |= PCI_ERR_UNC_SURPDN; + pci_write_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, aer_uncorr_mask); + dev_dbg(&usp->dev, "Non-hotplug slot: Surprise Link Down masked for cold reset\n"); +} +#endif /* CONFIG_PCIEAER */ + static void punit_error_handler(struct xe_device *xe) { +#ifdef CONFIG_PCIEAER + struct pci_dev *pdev = to_pci_dev(xe->drm.dev); + struct pci_dev *vsp, *usp; + + /* + * Device Hierarchy: + * + * Root Port --> Upstream Switch Port (USP) --> Virtual Switch Port (VSP) --> SGunit + * + * Cold reset power-cycles the slot, dropping the PCIe link. On a non-hotplug + * slot this triggers a spurious Surprise Link Down AER event on the USP. + * Suppress it if the slot is not hotplug capable. + */ + vsp = pci_upstream_bridge(pdev); + usp = vsp ? pci_upstream_bridge(vsp) : NULL; + + if (usp && !pcie_slot_is_hotplug_capable(usp)) + pcie_suppress_surprise_link_down(usp); +#endif xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_COLD_RESET); xe_device_declare_wedged(xe); } -- 2.34.1 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Claude review: drm/xe: Suppress Surprise Link Down on non-hotplug device 2026-05-20 11:33 ` [PATCH v6 5/5] drm/xe: Suppress Surprise Link Down on non-hotplug device Mallesh Koujalagi @ 2026-05-25 11:57 ` Claude Code Review Bot 0 siblings, 0 replies; 14+ messages in thread From: Claude Code Review Bot @ 2026-05-25 11:57 UTC (permalink / raw) To: dri-devel-reviews Patch Review The rationale is sound: a cold reset drops the PCIe link, and on non-hotplug slots this triggers a spurious Surprise Link Down AER event that defeats the recovery. Masking SLD on the USP before requesting the cold reset prevents this. **Concern: SLD mask is never restored** ```c pci_read_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, &aer_uncorr_mask); aer_uncorr_mask |= PCI_ERR_UNC_SURPDN; pci_write_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, aer_uncorr_mask); ``` If the cold reset succeeds, hardware presumably reinitializes the AER registers, so the mask is implicitly cleared. But if the cold reset is never performed (userspace ignores the uevent) or fails, the SLD mask remains permanently applied on the USP. This means legitimate surprise link-down events would be silently masked for the remaining uptime of the system. Consider documenting this trade-off explicitly, or adding a comment explaining why restoration isn't needed. **`pcie_slot_is_hotplug_capable()` — correct but worth a comment** ```c static bool pcie_slot_is_hotplug_capable(struct pci_dev *usp) { struct pci_dev *root_port = pci_upstream_bridge(usp); ``` The function name says "slot" but it takes the USP and walks up to the root port. This is correct (the slot the USP is plugged into is owned by the root port), but the naming could confuse readers. A brief comment like `/* The hotplug slot is on usp's parent (root port) */` would help. **Hotplug capability check:** ```c return (sltcap & (PCI_EXP_SLTCAP_HPC | PCI_EXP_SLTCAP_PCP)) == (PCI_EXP_SLTCAP_HPC | PCI_EXP_SLTCAP_PCP); ``` Checking both Hot-Plug Capable and Power Controller Present is correct — both must be present for the slot to support power cycling that would make a surprise link-down expected. --- Generated by Claude Code Patch Reviewer ^ permalink raw reply [flat|nested] 14+ messages in thread
* Claude review: Introduce cold reset recovery method 2026-05-20 11:33 [PATCH v6 0/5] Introduce cold reset recovery method Mallesh Koujalagi ` (4 preceding siblings ...) 2026-05-20 11:33 ` [PATCH v6 5/5] drm/xe: Suppress Surprise Link Down on non-hotplug device Mallesh Koujalagi @ 2026-05-25 11:57 ` Claude Code Review Bot 5 siblings, 0 replies; 14+ messages in thread From: Claude Code Review Bot @ 2026-05-25 11:57 UTC (permalink / raw) To: dri-devel-reviews Overall Series Review Subject: Introduce cold reset recovery method Author: Mallesh Koujalagi <mallesh.koujalagi@intel.com> Patches: 6 Reviewed: 2026-05-25T21:57:08.799531 --- This v6 series adds a `DRM_WEDGE_RECOVERY_COLD_RESET` recovery method for Intel Xe GPUs, targeting PUNIT errors that persist across warm resets and require a full slot power cycle. The design is sound: detect the error via RAS infrastructure, wedge the device, emit a uevent, and let userspace perform the power cycle. The DRM core change (patch 2) is clean and minimal. The documentation (patch 3) is well-written with a practical example script. The PUNIT handler (patch 4) correctly uses the wedge API. The AER SLD suppression (patch 5) is architecturally sensible but has a concern about the mask never being restored on failure. Patch 1 is marked "DO NOT REVIEW - COMPILATION ONLY" but has a clear uninitialized variable bug and a wedge API bypass that could trigger an assertion, which I flag below for completeness since it ships as part of this build. **Key concerns:** - Uninitialized variable in `handle_page_offline()` (patch 1) - `xe_pci_error_handling()` bypasses `xe_device_declare_wedged()` and could hit the `xe_gt_assert` if `wedged.mode == XE_WEDGED_MODE_NEVER` (patch 1) - Duplicate `XE_SYSCTRL_FLOOD` macro definition across two headers (patch 1) - SLD mask in patch 5 is never restored if cold reset fails or is never performed --- --- Generated by Claude Code Patch Reviewer ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v5 0/5] Introduce cold reset recovery method @ 2026-05-12 13:26 Mallesh Koujalagi 2026-05-12 13:26 ` [PATCH v5 4/5] drm/xe: Handle PUNIT errors by requesting cold-reset recovery Mallesh Koujalagi 0 siblings, 1 reply; 14+ messages in thread From: Mallesh Koujalagi @ 2026-05-12 13:26 UTC (permalink / raw) To: intel-xe, dri-devel, rodrigo.vivi Cc: andrealmeid, christian.koenig, airlied, simona.vetter, mripard, maarten.lankhorst, tzimmermann, anshuman.gupta, badal.nilawar, riana.tauro, karthik.poosa, sk.anirban, raag.jadav, Mallesh Koujalagi This series builds on top of Introduce Xe Uncorrectable Error Handling[1] and adds support for handling errors that require a complete device power cycle (cold reset) to recover. Certain error conditions leave the device in a persistent hardware error state that cannot be cleared through existing recovery mechanisms such as driver reload or PCIe reset. In these cases, functionality can only be restored by performing a cold reset. To support this, the series introduces a new DRM wedging recovery method, DRM_WEDGE_RECOVERY_COLD_RESET (BIT(4)). When a device is wedged with this method, the DRM core notifies userspace via a uevent that a cold reset is required. This allows userspace to take appropriate action to power-cycle the device. Example uevent received: SUBSYSTEM=drm WEDGED=cold-reset DEVPATH=/devices/.../drm/card0 Detailed description in commit message. [1] https://patchwork.freedesktop.org/series/160482/ This patch series introduces a call to punit_error_handler() from within handle_soc_internal_errors() when PUNIT errors detected. v2: - Add use case: Handling errors from power management unit, which requires a complete power cycle to recover. (Christian) - Add several instead of number to avoid update. (Jani) v3: - Update any scenario that requires cold-reset. (Riana) - Update document with generic scenario. (Riana) - Consistent with terminology. (Raag) - Remove already covered information. - Use PUNIT instead of PMU. (Riana) - Use consistent wordingi. - Remove log. (Raag) v4: - Rename cold reset to power cyclce. (Raag) - Update doc. (Raag/Riana) - Change commit message. (Raag) - Make function static. (Raag) v5: - Make it consistent with consumer expectations. (Raag) - Update commit message. - Remove unbind. - Simplify cold-reset script. - Remove kdoc for static function. - Remove xe_ prefix for static function. Cc: André Almeida <andrealmeid@igalia.com> Cc: Christian König <christian.koenig@amd.com> Cc: David Airlie <airlied@gmail.com> Cc: Simona Vetter <simona.vetter@ffwll.ch> Cc: Maxime Ripard <mripard@kernel.org> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Thomas Zimmermann <tzimmermann@suse.de> Mallesh Koujalagi (4): drm: Add DRM_WEDGE_RECOVERY_COLD_RESET recovery method drm/doc: Document DRM_WEDGE_RECOVERY_COLD_RESET recovery method drm/xe: Handle PUNIT errors by requesting cold-reset recovery drm/xe: Suppress Surprise Link Down on non-hotplug device Riana Tauro (1): Introduce Xe Uncorrectable Error Handling Documentation/gpu/drm-uapi.rst | 64 +- drivers/gpu/drm/drm_drv.c | 2 + drivers/gpu/drm/xe/Makefile | 1 + drivers/gpu/drm/xe/xe_device.c | 19 +- drivers/gpu/drm/xe/xe_device.h | 15 + drivers/gpu/drm/xe/xe_device_types.h | 6 + drivers/gpu/drm/xe/xe_gt.c | 14 +- drivers/gpu/drm/xe/xe_guc_submit.c | 9 +- drivers/gpu/drm/xe/xe_pci.c | 10 + drivers/gpu/drm/xe/xe_pci_error.c | 138 +++++ drivers/gpu/drm/xe/xe_ras.c | 552 ++++++++++++++++++ drivers/gpu/drm/xe/xe_ras.h | 5 +- drivers/gpu/drm/xe/xe_ras_types.h | 215 +++++++ drivers/gpu/drm/xe/xe_survivability_mode.c | 13 +- drivers/gpu/drm/xe/xe_sysctrl_event.c | 2 +- drivers/gpu/drm/xe/xe_sysctrl_event_types.h | 2 +- drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h | 11 + include/drm/drm_device.h | 1 + 18 files changed, 1058 insertions(+), 21 deletions(-) create mode 100644 drivers/gpu/drm/xe/xe_pci_error.c -- 2.34.1 ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v5 4/5] drm/xe: Handle PUNIT errors by requesting cold-reset recovery 2026-05-12 13:26 [PATCH v5 0/5] " Mallesh Koujalagi @ 2026-05-12 13:26 ` Mallesh Koujalagi 2026-05-16 3:17 ` Claude review: " Claude Code Review Bot 0 siblings, 1 reply; 14+ messages in thread From: Mallesh Koujalagi @ 2026-05-12 13:26 UTC (permalink / raw) To: intel-xe, dri-devel, rodrigo.vivi Cc: andrealmeid, christian.koenig, airlied, simona.vetter, mripard, maarten.lankhorst, tzimmermann, anshuman.gupta, badal.nilawar, riana.tauro, karthik.poosa, sk.anirban, raag.jadav, Mallesh Koujalagi When PUNIT (power management unit) errors are detected that persist across warm resets, mark the device as wedged with DRM_WEDGE_RECOVERY_COLD_RESET and notify userspace that a complete device power cycle is required to restore normal operation. Signed-off-by: Mallesh Koujalagi <mallesh.koujalagi@intel.com> --- v3: - Use PUNIT instead of PMU. (Riana) - Use consistent wordingi. - Remove log. (Raag) v4: - Make function static. (Raag) v5: - Remove kdoc for static function. (Raag) - Remove xe_ prefix for static function. --- drivers/gpu/drm/xe/xe_ras.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c index d79f8a6589ac..604470565bf3 100644 --- a/drivers/gpu/drm/xe/xe_ras.c +++ b/drivers/gpu/drm/xe/xe_ras.c @@ -4,6 +4,8 @@ */ #include "xe_bo.h" +#include "xe_assert.h" +#include "xe_device_types.h" #include "xe_device.h" #include "xe_printk.h" #include "xe_ras.h" @@ -222,6 +224,12 @@ static enum xe_ras_recovery_action handle_core_compute_errors(struct xe_device * return XE_RAS_RECOVERY_ACTION_RECOVERED; } +static void punit_error_handler(struct xe_device *xe) +{ + xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_COLD_RESET); + xe_device_declare_wedged(xe); +} + static enum xe_ras_recovery_action handle_soc_internal_errors(struct xe_device *xe, struct xe_ras_error_array *arr) { @@ -265,7 +273,7 @@ static enum xe_ras_recovery_action handle_soc_internal_errors(struct xe_device * xe_err(xe, "[RAS]: PUNIT %s detected: 0x%x\n", sev_to_str(error_class->common.severity), ieh_error->global_error_status); - /* TODO: Add PUNIT error handling */ + punit_error_handler(xe); return XE_RAS_RECOVERY_ACTION_DISCONNECT; } } -- 2.34.1 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Claude review: drm/xe: Handle PUNIT errors by requesting cold-reset recovery 2026-05-12 13:26 ` [PATCH v5 4/5] drm/xe: Handle PUNIT errors by requesting cold-reset recovery Mallesh Koujalagi @ 2026-05-16 3:17 ` Claude Code Review Bot 0 siblings, 0 replies; 14+ messages in thread From: Claude Code Review Bot @ 2026-05-16 3:17 UTC (permalink / raw) To: dri-devel-reviews Patch Review This is the core functional patch and is quite small. ```c +static void punit_error_handler(struct xe_device *xe) +{ + xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_COLD_RESET); + xe_device_declare_wedged(xe); +} ``` Clean and correct — sets the recovery method, then declares the device wedged, which triggers the uevent to userspace. The call site change is also correct: ```c - /* TODO: Add PUNIT error handling */ + punit_error_handler(xe); return XE_RAS_RECOVERY_ACTION_DISCONNECT; ``` The function is properly `static` (as requested in v4 review feedback) and the `xe_` prefix is removed (v5 feedback). **Note:** `xe_assert.h` and `xe_device_types.h` are added as includes but neither appears to be directly needed by the code added in this patch. The `xe_assert.h` include is unused. `xe_device_types.h` would already be pulled in transitively through `xe_device.h`. These may be leftover from a previous version. ```c +#include "xe_assert.h" +#include "xe_device_types.h" ``` --- --- Generated by Claude Code Patch Reviewer ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v3 0/4] Introduce cold reset recovery method @ 2026-04-06 14:23 Mallesh Koujalagi 2026-04-06 14:23 ` [PATCH v3 4/4] drm/xe: Handle PUNIT errors by requesting cold-reset recovery Mallesh Koujalagi 0 siblings, 1 reply; 14+ messages in thread From: Mallesh Koujalagi @ 2026-04-06 14:23 UTC (permalink / raw) To: intel-xe, dri-devel, rodrigo.vivi Cc: andrealmeid, christian.koenig, airlied, simona.vetter, mripard, anshuman.gupta, badal.nilawar, riana.tauro, karthik.poosa, sk.anirban, raag.jadav, Mallesh Koujalagi This series builds on top of Introduce Xe Uncorrectable Error Handling[1] and adds support for handling errors that require a complete device power cycle (cold reset) to recover. Certain error conditions leave the device in a persistent hardware error state that cannot be cleared through existing recovery mechanisms such as driver reload or PCIe reset. In these cases, functionality can only be restored by performing a cold reset. To support this, the series introduces a new DRM wedging recovery method, DRM_WEDGE_RECOVERY_COLD_RESET (BIT(4)). When a device is wedged with this method, the DRM core notifies userspace via a uevent that a cold reset is required. This allows userspace to take appropriate action to power-cycle the device. Example uevent received: SUBSYSTEM=drm WEDGED=cold-reset DEVPATH=/devices/.../drm/card0 Detailed description in commit message. [1] https://patchwork.freedesktop.org/series/160482/ This patch series introduces a call to xe_punit_error_handler() from within handle_soc_internal_errors() when PUNIT errors detected. v2: - Add use case: Handling errors from power management unit, which requires a complete power cycle to recover. (Christian) - Add several instead of number to avoid update. (Jani) v3: - Update any scenario that requires cold-reset. (Riana) - Update document with generic scenario. (Riana) - Consistent with terminology. (Raag) - Remove already covered information. - Use PUNIT instead of PMU. (Riana) - Use consistent wordingi. - Remove log. (Raag) Cc: André Almeida <andrealmeid@igalia.com> Cc: Christian König <christian.koenig@amd.com> Cc: David Airlie <airlied@gmail.com> Cc: Simona Vetter <simona.vetter@ffwll.ch> Cc: Maxime Ripard <mripard@kernel.org> Mallesh Koujalagi (3): drm: Add DRM_WEDGE_RECOVERY_COLD_RESET recovery method drm/doc: Document DRM_WEDGE_RECOVERY_COLD_RESET recovery method drm/xe: Handle PUNIT errors by requesting cold-reset recovery Riana Tauro (1): Introduce Xe Uncorrectable Error Handling Documentation/gpu/drm-uapi.rst | 60 +++- drivers/gpu/drm/drm_drv.c | 2 + drivers/gpu/drm/xe/Makefile | 2 + drivers/gpu/drm/xe/xe_device.c | 10 + drivers/gpu/drm/xe/xe_device.h | 15 + drivers/gpu/drm/xe/xe_device_types.h | 6 + drivers/gpu/drm/xe/xe_gt.c | 14 +- drivers/gpu/drm/xe/xe_guc_submit.c | 9 +- drivers/gpu/drm/xe/xe_pci.c | 3 + drivers/gpu/drm/xe/xe_pci_error.c | 118 ++++++ drivers/gpu/drm/xe/xe_ras.c | 337 ++++++++++++++++++ drivers/gpu/drm/xe/xe_ras.h | 17 + drivers/gpu/drm/xe/xe_ras_types.h | 203 +++++++++++ drivers/gpu/drm/xe/xe_survivability_mode.c | 12 +- drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h | 13 + include/drm/drm_device.h | 1 + 16 files changed, 813 insertions(+), 9 deletions(-) create mode 100644 drivers/gpu/drm/xe/xe_pci_error.c create mode 100644 drivers/gpu/drm/xe/xe_ras.c create mode 100644 drivers/gpu/drm/xe/xe_ras.h create mode 100644 drivers/gpu/drm/xe/xe_ras_types.h -- 2.34.1 ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v3 4/4] drm/xe: Handle PUNIT errors by requesting cold-reset recovery 2026-04-06 14:23 [PATCH v3 0/4] Introduce cold reset recovery method Mallesh Koujalagi @ 2026-04-06 14:23 ` Mallesh Koujalagi 2026-04-12 4:32 ` Claude review: " Claude Code Review Bot 0 siblings, 1 reply; 14+ messages in thread From: Mallesh Koujalagi @ 2026-04-06 14:23 UTC (permalink / raw) To: intel-xe, dri-devel, rodrigo.vivi Cc: andrealmeid, christian.koenig, airlied, simona.vetter, mripard, anshuman.gupta, badal.nilawar, riana.tauro, karthik.poosa, sk.anirban, raag.jadav, Mallesh Koujalagi When PUNIT (power management unit) errors are detected that persist across warm resets, mark the device as wedged with DRM_WEDGE_RECOVERY_COLD_RESET and notify userspace that a complete device cold reset is required to restore normal operation. v3: - Use PUNIT instead of PMU. (Riana) - Use consistent wordingi. - Remove log. (Raag) Signed-off-by: Mallesh Koujalagi <mallesh.koujalagi@intel.com> --- drivers/gpu/drm/xe/xe_ras.c | 21 ++++++++++++++++++++- drivers/gpu/drm/xe/xe_ras.h | 1 + 2 files changed, 21 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c index 437811845c01..e2e1ab3fb4ce 100644 --- a/drivers/gpu/drm/xe/xe_ras.c +++ b/drivers/gpu/drm/xe/xe_ras.c @@ -5,6 +5,7 @@ #include "xe_assert.h" #include "xe_device_types.h" +#include "xe_device.h" #include "xe_printk.h" #include "xe_ras.h" #include "xe_ras_types.h" @@ -93,6 +94,24 @@ static enum xe_ras_recovery_action handle_compute_errors(struct xe_device *xe, return XE_RAS_RECOVERY_ACTION_RECOVERED; } +/** + * xe_punit_error_handler - Handler for Punit errors requiring cold reset + * @xe: device instance + * + * Handles Punit errors that affect the device and cannot be recovered + * through driver reload, PCIe reset, etc. + * + * Marks the device as wedged with DRM_WEDGE_RECOVERY_COLD_RESET method + * and notifies userspace that a device cold reset is required. + */ +void xe_punit_error_handler(struct xe_device *xe) +{ + xe_err(xe, "Recovery: Device cold reset required\n"); + + xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_COLD_RESET); + xe_device_declare_wedged(xe); +} + static enum xe_ras_recovery_action handle_soc_internal_errors(struct xe_device *xe, struct xe_ras_error_array *arr) { @@ -132,7 +151,7 @@ static enum xe_ras_recovery_action handle_soc_internal_errors(struct xe_device * xe_err(xe, "[RAS]: PUNIT %s error detected: 0x%x\n", severity_to_str(xe, common.severity), ieh_error->global_error_status); - /** TODO: Add PUNIT error handling */ + xe_punit_error_handler(xe); return XE_RAS_RECOVERY_ACTION_DISCONNECT; } } diff --git a/drivers/gpu/drm/xe/xe_ras.h b/drivers/gpu/drm/xe/xe_ras.h index e191ab80080c..ab1fde200625 100644 --- a/drivers/gpu/drm/xe/xe_ras.h +++ b/drivers/gpu/drm/xe/xe_ras.h @@ -11,6 +11,7 @@ struct xe_device; void xe_ras_init(struct xe_device *xe); +void xe_punit_error_handler(struct xe_device *xe); enum xe_ras_recovery_action xe_ras_process_errors(struct xe_device *xe); #endif -- 2.34.1 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Claude review: drm/xe: Handle PUNIT errors by requesting cold-reset recovery 2026-04-06 14:23 ` [PATCH v3 4/4] drm/xe: Handle PUNIT errors by requesting cold-reset recovery Mallesh Koujalagi @ 2026-04-12 4:32 ` Claude Code Review Bot 0 siblings, 0 replies; 14+ messages in thread From: Claude Code Review Bot @ 2026-04-12 4:32 UTC (permalink / raw) To: dri-devel-reviews Patch Review **`xe_punit_error_handler()` double-wedging concern:** ```c void xe_punit_error_handler(struct xe_device *xe) { xe_err(xe, "Recovery: Device cold reset required\n"); xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_COLD_RESET); xe_device_declare_wedged(xe); } ``` This is called from `handle_soc_internal_errors()`, which is called from `xe_ras_process_errors()`, which is called from `xe_pci_error_mmio_enabled()`. But looking at `xe_pci_error_detected()` for the `pci_channel_io_frozen` case, `xe_pci_error_handling()` already sets `atomic_set(&xe->wedged.flag, 1)` and calls `xe_gt_declare_wedged()`. So by the time `xe_punit_error_handler()` runs and calls `xe_device_declare_wedged()`, the device is already partially wedged. `xe_device_declare_wedged()` uses `atomic_xchg(&xe->wedged.flag, 1)` which will see it's already set and skip the PM runtime get and error message, but it will still call `xe_gt_declare_wedged()` again for all GTs (double wedge). The uevent will still fire (good), but the flow is fragile and depends on ordering assumptions. For the `pci_channel_io_normal` path (which returns `CAN_RECOVER` and leads to `mmio_enabled`), the wedged flag is NOT pre-set, so `xe_punit_error_handler()` works correctly. But this means the behavior differs depending on which PCI channel state triggered the error — this should be documented. **Exported function placement:** `xe_punit_error_handler()` is declared in `xe_ras.h` but it's a fairly high-level operation (wedge + uevent). It might make more sense in `xe_device.c` or at least the name should reflect that it's specific to the RAS/PCI error recovery path, not a general-purpose API. **Header include ordering in xe_ras.c:** ```c +#include "xe_device.h" ``` Added between `xe_device_types.h` and `xe_printk.h` — the includes should be in alphabetical order. `xe_device.h` should come before `xe_device_types.h`. --- Generated by Claude Code Patch Reviewer ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2026-05-25 11:57 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-20 11:33 [PATCH v6 0/5] Introduce cold reset recovery method Mallesh Koujalagi 2026-05-20 11:33 ` [PATCH v6 1/5] Introduce Xe Uncorrectable Error Handling Mallesh Koujalagi 2026-05-25 11:57 ` Claude review: " Claude Code Review Bot 2026-05-20 11:33 ` [PATCH v6 2/5] drm: Add DRM_WEDGE_RECOVERY_COLD_RESET recovery method Mallesh Koujalagi 2026-05-25 11:57 ` Claude review: " Claude Code Review Bot 2026-05-20 11:33 ` [PATCH v6 3/5] drm/doc: Document " Mallesh Koujalagi 2026-05-25 11:57 ` Claude review: " Claude Code Review Bot 2026-05-20 11:33 ` [PATCH v6 4/5] drm/xe: Handle PUNIT errors by requesting cold-reset recovery Mallesh Koujalagi 2026-05-25 11:57 ` Claude review: " Claude Code Review Bot 2026-05-20 11:33 ` [PATCH v6 5/5] drm/xe: Suppress Surprise Link Down on non-hotplug device Mallesh Koujalagi 2026-05-25 11:57 ` Claude review: " Claude Code Review Bot 2026-05-25 11:57 ` Claude review: Introduce cold reset recovery method Claude Code Review Bot -- strict thread matches above, loose matches on Subject: below -- 2026-05-12 13:26 [PATCH v5 0/5] " Mallesh Koujalagi 2026-05-12 13:26 ` [PATCH v5 4/5] drm/xe: Handle PUNIT errors by requesting cold-reset recovery Mallesh Koujalagi 2026-05-16 3:17 ` Claude review: " Claude Code Review Bot 2026-04-06 14:23 [PATCH v3 0/4] Introduce cold reset recovery method Mallesh Koujalagi 2026-04-06 14:23 ` [PATCH v3 4/4] drm/xe: Handle PUNIT errors by requesting cold-reset recovery Mallesh Koujalagi 2026-04-12 4:32 ` Claude review: " Claude Code Review Bot
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox