public inbox for drm-ai-reviews@public-inbox.freedesktop.org
 help / color / mirror / Atom feed
From: Mallesh Koujalagi <mallesh.koujalagi@intel.com>
To: intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	rodrigo.vivi@intel.com
Cc: andrealmeid@igalia.com, christian.koenig@amd.com,
	airlied@gmail.com, simona.vetter@ffwll.ch, mripard@kernel.org,
	maarten.lankhorst@linux.intel.com, tzimmermann@suse.de,
	anshuman.gupta@intel.com, badal.nilawar@intel.com,
	riana.tauro@intel.com, karthik.poosa@intel.com,
	sk.anirban@intel.com, raag.jadav@intel.com,
	Mallesh Koujalagi <mallesh.koujalagi@intel.com>
Subject: [PATCH v6 5/5] drm/xe: Suppress Surprise Link Down on non-hotplug device
Date: Wed, 20 May 2026 17:03:57 +0530	[thread overview]
Message-ID: <20260520113351.171119-12-mallesh.koujalagi@intel.com> (raw)
In-Reply-To: <20260520113351.171119-7-mallesh.koujalagi@intel.com>

A PUNIT (power management unit) error recovery on GPUs
triggers a power cycle (cold reset). On platforms where
the upstream port is not hotplug capable, the brief link drop
caused by powering the device off and back on is reported
by hardware as a Surprise Link Down (SLD), which AER then
escalates as an Uncorrectable Fatal Error. That error fires
before the device finishes coming back up and defeats the
very recovery we are attempting.

To keep the expected, recovery-induced link drop from being raised as
a fatal AER event, mask the Surprise Link Down bit
(PCI_ERR_UNC_SURPDN) in the upstream port's AER Uncorrectable Error
Mask register before punit_error_handler() requests the cold reset.
The mask is only applied when the slot is not hotplug capable.

Signed-off-by: Mallesh Koujalagi <mallesh.koujalagi@intel.com>
---
v6:
- Expand commit message to explain why SUR_DN is masked. (Raag/Riana)
- Check Slot Implemented bit before reading Slot Capabilities, per
  PCIe spec. (Riana)
- Add debug log.
---
 drivers/gpu/drm/xe/xe_ras.c | 66 +++++++++++++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c
index 604470565bf3..f1a6d7b23c93 100644
--- a/drivers/gpu/drm/xe/xe_ras.c
+++ b/drivers/gpu/drm/xe/xe_ras.c
@@ -224,8 +224,74 @@ static enum xe_ras_recovery_action handle_core_compute_errors(struct xe_device *
 	return XE_RAS_RECOVERY_ACTION_RECOVERED;
 }
 
+#ifdef CONFIG_PCIEAER
+static bool pcie_slot_is_hotplug_capable(struct pci_dev *usp)
+{
+	struct pci_dev *root_port = pci_upstream_bridge(usp);
+	u32 sltcap;
+	u16 flags;
+
+	if (!root_port)
+		return false;
+
+	/*
+	 * Per PCIe spec, the Slot Capabilities register contents are
+	 * undefined unless the Slot Implemented bit in the PCI Express
+	 * Capabilities register is set. Check it before reading SLTCAP.
+	 */
+	if (pcie_capability_read_word(root_port, PCI_EXP_FLAGS, &flags))
+		return false;
+
+	if (!(flags & PCI_EXP_FLAGS_SLOT))
+		return false;
+
+	if (pcie_capability_read_dword(root_port, PCI_EXP_SLTCAP, &sltcap))
+		return false;
+
+	return (sltcap & (PCI_EXP_SLTCAP_HPC | PCI_EXP_SLTCAP_PCP)) ==
+		(PCI_EXP_SLTCAP_HPC | PCI_EXP_SLTCAP_PCP);
+}
+
+static void pcie_suppress_surprise_link_down(struct pci_dev *usp)
+{
+	u32 aer_uncorr_mask;
+	u16 aer_cap;
+
+	aer_cap = usp->aer_cap;
+	if (!aer_cap) {
+		dev_dbg(&usp->dev,
+			"AER capability not present; cannot mask Surprise Link Down for cold reset\n");
+		return;
+	}
+
+	pci_read_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, &aer_uncorr_mask);
+	aer_uncorr_mask |= PCI_ERR_UNC_SURPDN;
+	pci_write_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, aer_uncorr_mask);
+	dev_dbg(&usp->dev, "Non-hotplug slot: Surprise Link Down masked for cold reset\n");
+}
+#endif /* CONFIG_PCIEAER */
+
 static void punit_error_handler(struct xe_device *xe)
 {
+#ifdef CONFIG_PCIEAER
+	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
+	struct pci_dev *vsp, *usp;
+
+	/*
+	 * Device Hierarchy:
+	 *
+	 * Root Port --> Upstream Switch Port (USP) --> Virtual Switch Port (VSP) --> SGunit
+	 *
+	 * Cold reset power-cycles the slot, dropping the PCIe link. On a non-hotplug
+	 * slot this triggers a spurious Surprise Link Down AER event on the USP.
+	 * Suppress it if the slot is not hotplug capable.
+	 */
+	vsp = pci_upstream_bridge(pdev);
+	usp = vsp ? pci_upstream_bridge(vsp) : NULL;
+
+	if (usp && !pcie_slot_is_hotplug_capable(usp))
+		pcie_suppress_surprise_link_down(usp);
+#endif
 	xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_COLD_RESET);
 	xe_device_declare_wedged(xe);
 }
-- 
2.34.1


  parent reply	other threads:[~2026-05-20 11:37 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-20 11:33 [PATCH v6 0/5] Introduce cold reset recovery method Mallesh Koujalagi
2026-05-20 11:33 ` [PATCH v6 1/5] Introduce Xe Uncorrectable Error Handling Mallesh Koujalagi
2026-05-25 11:57   ` Claude review: " Claude Code Review Bot
2026-05-20 11:33 ` [PATCH v6 2/5] drm: Add DRM_WEDGE_RECOVERY_COLD_RESET recovery method Mallesh Koujalagi
2026-05-25 11:57   ` Claude review: " Claude Code Review Bot
2026-05-20 11:33 ` [PATCH v6 3/5] drm/doc: Document " Mallesh Koujalagi
2026-05-25 11:57   ` Claude review: " Claude Code Review Bot
2026-05-20 11:33 ` [PATCH v6 4/5] drm/xe: Handle PUNIT errors by requesting cold-reset recovery Mallesh Koujalagi
2026-05-25 11:57   ` Claude review: " Claude Code Review Bot
2026-05-20 11:33 ` Mallesh Koujalagi [this message]
2026-05-25 11:57   ` Claude review: drm/xe: Suppress Surprise Link Down on non-hotplug device Claude Code Review Bot
2026-05-25 11:57 ` Claude review: Introduce cold reset recovery method Claude Code Review Bot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260520113351.171119-12-mallesh.koujalagi@intel.com \
    --to=mallesh.koujalagi@intel.com \
    --cc=airlied@gmail.com \
    --cc=andrealmeid@igalia.com \
    --cc=anshuman.gupta@intel.com \
    --cc=badal.nilawar@intel.com \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=karthik.poosa@intel.com \
    --cc=maarten.lankhorst@linux.intel.com \
    --cc=mripard@kernel.org \
    --cc=raag.jadav@intel.com \
    --cc=riana.tauro@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=simona.vetter@ffwll.ch \
    --cc=sk.anirban@intel.com \
    --cc=tzimmermann@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox