From: Mallesh Koujalagi <mallesh.koujalagi@intel.com>
To: intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
rodrigo.vivi@intel.com
Cc: andrealmeid@igalia.com, christian.koenig@amd.com,
airlied@gmail.com, simona.vetter@ffwll.ch, mripard@kernel.org,
anshuman.gupta@intel.com, badal.nilawar@intel.com,
riana.tauro@intel.com, karthik.poosa@intel.com,
sk.anirban@intel.com, raag.jadav@intel.com,
Mallesh Koujalagi <mallesh.koujalagi@intel.com>
Subject: [PATCH v3 0/4] Introduce cold reset recovery method
Date: Mon, 6 Apr 2026 19:53:26 +0530 [thread overview]
Message-ID: <20260406142325.157035-6-mallesh.koujalagi@intel.com> (raw)
This series builds on top of Introduce Xe Uncorrectable Error Handling[1]
and adds support for handling errors that require a complete
device power cycle (cold reset) to recover.
Certain error conditions leave the device in a persistent hardware
error state that cannot be cleared through existing recovery mechanisms
such as driver reload or PCIe reset. In these cases, functionality can
only be restored by performing a cold reset.
To support this, the series introduces a new DRM wedging recovery
method, DRM_WEDGE_RECOVERY_COLD_RESET (BIT(4)). When a device is wedged
with this method, the DRM core notifies userspace via a uevent that a cold
reset is required. This allows userspace to take appropriate action to
power-cycle the device.
Example uevent received:
SUBSYSTEM=drm
WEDGED=cold-reset
DEVPATH=/devices/.../drm/card0
Detailed description in commit message.
[1] https://patchwork.freedesktop.org/series/160482/
This patch series introduces a call to xe_punit_error_handler() from
within handle_soc_internal_errors() when PUNIT errors detected.
v2:
- Add use case: Handling errors from power management unit,
which requires a complete power cycle to
recover. (Christian)
- Add several instead of number to avoid update. (Jani)
v3:
- Update any scenario that requires cold-reset. (Riana)
- Update document with generic scenario. (Riana)
- Consistent with terminology. (Raag)
- Remove already covered information.
- Use PUNIT instead of PMU. (Riana)
- Use consistent wordingi.
- Remove log. (Raag)
Cc: André Almeida <andrealmeid@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona.vetter@ffwll.ch>
Cc: Maxime Ripard <mripard@kernel.org>
Mallesh Koujalagi (3):
drm: Add DRM_WEDGE_RECOVERY_COLD_RESET recovery method
drm/doc: Document DRM_WEDGE_RECOVERY_COLD_RESET recovery method
drm/xe: Handle PUNIT errors by requesting cold-reset recovery
Riana Tauro (1):
Introduce Xe Uncorrectable Error Handling
Documentation/gpu/drm-uapi.rst | 60 +++-
drivers/gpu/drm/drm_drv.c | 2 +
drivers/gpu/drm/xe/Makefile | 2 +
drivers/gpu/drm/xe/xe_device.c | 10 +
drivers/gpu/drm/xe/xe_device.h | 15 +
drivers/gpu/drm/xe/xe_device_types.h | 6 +
drivers/gpu/drm/xe/xe_gt.c | 14 +-
drivers/gpu/drm/xe/xe_guc_submit.c | 9 +-
drivers/gpu/drm/xe/xe_pci.c | 3 +
drivers/gpu/drm/xe/xe_pci_error.c | 118 ++++++
drivers/gpu/drm/xe/xe_ras.c | 337 ++++++++++++++++++
drivers/gpu/drm/xe/xe_ras.h | 17 +
drivers/gpu/drm/xe/xe_ras_types.h | 203 +++++++++++
drivers/gpu/drm/xe/xe_survivability_mode.c | 12 +-
drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h | 13 +
include/drm/drm_device.h | 1 +
16 files changed, 813 insertions(+), 9 deletions(-)
create mode 100644 drivers/gpu/drm/xe/xe_pci_error.c
create mode 100644 drivers/gpu/drm/xe/xe_ras.c
create mode 100644 drivers/gpu/drm/xe/xe_ras.h
create mode 100644 drivers/gpu/drm/xe/xe_ras_types.h
--
2.34.1
next reply other threads:[~2026-04-06 14:25 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-06 14:23 Mallesh Koujalagi [this message]
2026-04-06 14:23 ` [PATCH v3 1/4] Introduce Xe Uncorrectable Error Handling Mallesh Koujalagi
2026-04-12 4:32 ` Claude review: " Claude Code Review Bot
2026-04-06 14:23 ` [PATCH v3 2/4] drm: Add DRM_WEDGE_RECOVERY_COLD_RESET recovery method Mallesh Koujalagi
2026-04-08 7:46 ` Raag Jadav
2026-04-12 4:32 ` Claude review: " Claude Code Review Bot
2026-04-06 14:23 ` [PATCH v3 3/4] drm/doc: Document " Mallesh Koujalagi
2026-04-08 8:01 ` Raag Jadav
2026-04-12 4:32 ` Claude review: " Claude Code Review Bot
2026-04-06 14:23 ` [PATCH v3 4/4] drm/xe: Handle PUNIT errors by requesting cold-reset recovery Mallesh Koujalagi
2026-04-08 8:09 ` Raag Jadav
2026-04-12 4:32 ` Claude review: " Claude Code Review Bot
2026-04-12 4:32 ` Claude review: Introduce cold reset recovery method Claude Code Review Bot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260406142325.157035-6-mallesh.koujalagi@intel.com \
--to=mallesh.koujalagi@intel.com \
--cc=airlied@gmail.com \
--cc=andrealmeid@igalia.com \
--cc=anshuman.gupta@intel.com \
--cc=badal.nilawar@intel.com \
--cc=christian.koenig@amd.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=intel-xe@lists.freedesktop.org \
--cc=karthik.poosa@intel.com \
--cc=mripard@kernel.org \
--cc=raag.jadav@intel.com \
--cc=riana.tauro@intel.com \
--cc=rodrigo.vivi@intel.com \
--cc=simona.vetter@ffwll.ch \
--cc=sk.anirban@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox