* Claude review: drm/xe/xe_hw_error: Add support for Core-Compute errors
2026-02-23 6:05 ` [PATCH v8 4/5] drm/xe/xe_hw_error: Add support for Core-Compute errors Riana Tauro
@ 2026-02-24 0:45 ` Claude Code Review Bot
0 siblings, 0 replies; 15+ messages in thread
From: Claude Code Review Bot @ 2026-02-24 0:45 UTC (permalink / raw)
To: dri-devel-reviews
Patch Review
**Platform guard placement:**
The `hw_error_source_handler` changes the platform check from `XE_BATTLEMAGE` to a broader `IS_DGFX(xe)` check:
> - if (xe->info.platform != XE_BATTLEMAGE)
> + if (!IS_DGFX(xe))
> return;
But `gt_hw_error_handler` has its own platform guard:
> + if (xe->info.platform != XE_PVC)
> + return;
This means for Battlemage (which is DGFX but not PVC), `hw_error_source_handler` will now enter the for_each_set_bit loop and try to process GT/SOC errors via `xe_hw_error_map`, but `ras->info[severity]` was never initialized (since `hw_error_info_init` only runs on PVC). The `if (!info) goto clear_reg` check prevents a crash, but the code path that reaches `gt_hw_error_handler` only to immediately return on the PVC check is somewhat wasteful and fragile. If `xe_hw_error_map` is ever extended for another platform, the guards inside the sub-handlers would need updating too.
**Double-counting in subslice error path:**
> + case ERR_STAT_GT_VECTOR0:
> + case ERR_STAT_GT_VECTOR1: {
> + u32 errbit;
> +
> + val = hweight32(vector);
> + atomic_add(val, &info[error_id].counter);
> + ...
> + err_stat = xe_mmio_read32(mmio, ERR_STAT_GT_REG(hw_err));
> + for_each_set_bit(errbit, &err_stat, GT_HW_ERROR_MAX_ERR_BITS) {
> + if (PVC_ERROR_MASK_SET(hw_err, errbit))
> + atomic_inc(&info[error_id].counter);
> + }
For subslice errors, the code first counts by `hweight32(vector)` (number of set bits in the vector register), then also iterates the error status register and increments the counter for each set bit matching the error mask. Are these truly independent error sources that should each contribute to the counter? Or is the error status register providing detail about the same errors reported in the vector? If it's the same errors, this is double-counting. The comment says the status register is "only populated once per error", which suggests it's supplementary detail, not additional errors. If that's the case, the `atomic_inc` for the status register bits is double-counting.
**`xe_hw_error_map` size vs `XE_RAS_REG_SIZE`:**
> +static const unsigned long xe_hw_error_map[] = {
> + [XE_GT_ERROR] = DRM_XE_RAS_ERR_COMP_CORE_COMPUTE,
> +};
This array has only 1 element (index 0). In `hw_error_source_handler`:
> + if (err_bit >= ARRAY_SIZE(xe_hw_error_map))
> + break;
When `err_bit >= 1`, the loop breaks entirely rather than continuing to the next bit. This means if any bit above bit 0 is set in `err_src`, all subsequent bits are skipped entirely. Should this be `continue` instead of `break`? With `break`, a CSC error at bit 17 would never be reached via this loop (though it's handled by the earlier `if (err_src & REG_BIT(XE_CSC_ERROR))` check before the loop). After patch 5 extends the map to include `[XE_SOC_ERROR] = ...` at index 16, this break at index 1 would prevent ever reaching index 16. Looking at it more carefully -- in patch 5, `xe_hw_error_map` is extended to `[16]`, so `ARRAY_SIZE` becomes 17. That solves the problem for SOC errors, but the `break` vs `continue` semantics still matter for bits between 1 and 15 that aren't in the map.
Actually wait, with the extended array from patch 5, `ARRAY_SIZE(xe_hw_error_map)` = 17 (indices 0-16), and `err_bit >= 17` would break. `XE_CSC_ERROR` is bit 17, which was already handled before the loop. So the `break` works for this specific set of patches, but it's fragile -- `continue` would be more robust.
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v9 0/5] Introduce DRM_RAS using generic netlink for RAS
@ 2026-02-28 8:08 Riana Tauro
2026-02-28 8:08 ` [PATCH v9 1/5] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Riana Tauro
` (5 more replies)
0 siblings, 6 replies; 15+ messages in thread
From: Riana Tauro @ 2026-02-28 8:08 UTC (permalink / raw)
To: intel-xe, dri-devel
Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
raag.jadav, anvesh.bakwad, Riana Tauro
This work is a continuation of the great work started by Aravind ([1] and [2])
in order to fulfill the RAS requirements and proposal as previously discussed
and agreed in the Linux Plumbers accelerator's bof of 2022 [3].
[1]: https://lore.kernel.org/dri-devel/20250730064956.1385855-1-aravind.iddamsetty@linux.intel.com/
[2]: https://lore.kernel.org/all/4cbdfcc5-5020-a942-740e-a602d4c00cc2@linux.intel.com/
[3]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
During the past review round, Lukas pointed out that netlink had evolved
in parallel during these years and that now, any new usage of netlink families
would require the usage of the YAML description and scripts.
With this new requirement in place, the family name is hardcoded in the yaml file,
so we are forced to have a single family name for the entire drm, and then we now
we are forced to have a registration.
So, while doing the registration, we now created the concept of drm-ras-node.
For now the only node type supported is the agreed error-counter. But that could
be expanded for other cases like telemetry, requested by Zack for the qualcomm accel
driver.
In this first version, only querying counter is supported. But also this is expandable
to future introduction of multicast notification and also clearing the counters.
This design with multiple nodes per device is already flexible enough for driver
to decide if it wants to handle error per device, or per IP block, or per error
category. I believe this fully attend to the requested AMD feedback in the earlier
reviews.
So, my proposal is to start simple with this case as is, and then iterate over
with the drm-ras in tree so we evolve together according to various driver's RAS
needs.
I have provided a documentation and the first Xe implementation of the counter
as reference.
Also, it is worth to mention that we have a in-tree pyynl/cli.py tool that entirely
exercises this new API, hence I hope this can be the reference code for the uAPI
usage, while we continue with the plan of introducing IGT tests and tools for this
and adjusting the internal vendor tools to open with open source developments and
changing them to support these flows.
Example:
1) List Nodes:
$ sudo ynl --family drm_ras --dump list-nodes
[{'device-name': '0000:03:00.0',
'node-id': 0,
'node-name': 'correctable-errors',
'node-type': 'error-counter'},
{'device-name': '0000:03:00.0',
'node-id': 1,
'node-name': 'uncorrectable-errors',
'node-type': 'error-counter'}]
2) Get Error counters:
$ sudo ynl --family drm_ras --dump get-error-counter --json '{"node-id":0}'
[{'error-id': 1, 'error-name': 'core-compute', 'error-value': 0},
{'error-id': 2, 'error-name': 'soc-internal', 'error-value': 0}]
3) Get specific Error counter:
$ sudo ynl --family drm_ras --do get-error-counter --json '{"node-id":0, "error-id":1}'
{'error-id': 1, 'error-name': 'core-compute', 'error-value': 0}
IGT : https://patchwork.freedesktop.org/patch/689729/?series=157409&rev=3
Rev2: Fix review comments
Add support for GT and SOC errors
Rev3: Add uAPI for errors and nodes
Update documentation
Rev4: Use only correctable and uncorrectable error nodes
use REG_BIT
remove redundant error strings
Rev5: Split patch 2
use atomic_t
fix memory leaks
fix logs
fix hook failure
change component and severity UAPI
Rev6: fix alignment
fix comparison in CSC error
add severity string to csc error
rename soc error handler base register variables
deallocate info if drm ras registeration fails
rename init function to xe_drm_ras_init()
fix htmldocs errors
Add 'depends on NET' for drm ras netlink
Rev7: add macro for gt vector length and master local registers
print errors on failure
Rev8: use single command for both do/dump operations
fix yamllint errors
regenerate files
fix kernel-doc
Rev9: fix ynl-regen.sh reported issue
move drm_ras_nl.h from include/ to drm/
Riana Tauro (4):
drm/xe/xe_drm_ras: Add support for XE DRM RAS
drm/xe/xe_hw_error: Integrate DRM RAS with hardware error handling
drm/xe/xe_hw_error: Add support for Core-Compute errors
drm/xe/xe_hw_error: Add support for PVC SoC errors
Rodrigo Vivi (1):
drm/ras: Introduce the DRM RAS infrastructure over generic netlink
Documentation/gpu/drm-ras.rst | 103 +++++
Documentation/gpu/index.rst | 1 +
Documentation/netlink/specs/drm_ras.yaml | 115 ++++++
drivers/gpu/drm/Kconfig | 10 +
drivers/gpu/drm/Makefile | 1 +
drivers/gpu/drm/drm_drv.c | 6 +
drivers/gpu/drm/drm_ras.c | 354 ++++++++++++++++
drivers/gpu/drm/drm_ras_genl_family.c | 42 ++
drivers/gpu/drm/drm_ras_nl.c | 56 +++
drivers/gpu/drm/drm_ras_nl.h | 24 ++
drivers/gpu/drm/xe/Makefile | 1 +
drivers/gpu/drm/xe/regs/xe_hw_error_regs.h | 86 +++-
drivers/gpu/drm/xe/xe_device_types.h | 4 +
drivers/gpu/drm/xe/xe_drm_ras.c | 186 +++++++++
drivers/gpu/drm/xe/xe_drm_ras.h | 15 +
drivers/gpu/drm/xe/xe_drm_ras_types.h | 48 +++
drivers/gpu/drm/xe/xe_hw_error.c | 451 +++++++++++++++++++--
include/drm/drm_ras.h | 75 ++++
include/drm/drm_ras_genl_family.h | 17 +
include/uapi/drm/drm_ras.h | 49 +++
include/uapi/drm/xe_drm.h | 79 ++++
21 files changed, 1681 insertions(+), 42 deletions(-)
create mode 100644 Documentation/gpu/drm-ras.rst
create mode 100644 Documentation/netlink/specs/drm_ras.yaml
create mode 100644 drivers/gpu/drm/drm_ras.c
create mode 100644 drivers/gpu/drm/drm_ras_genl_family.c
create mode 100644 drivers/gpu/drm/drm_ras_nl.c
create mode 100644 drivers/gpu/drm/drm_ras_nl.h
create mode 100644 drivers/gpu/drm/xe/xe_drm_ras.c
create mode 100644 drivers/gpu/drm/xe/xe_drm_ras.h
create mode 100644 drivers/gpu/drm/xe/xe_drm_ras_types.h
create mode 100644 include/drm/drm_ras.h
create mode 100644 include/drm/drm_ras_genl_family.h
create mode 100644 include/uapi/drm/drm_ras.h
--
2.47.1
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v9 1/5] drm/ras: Introduce the DRM RAS infrastructure over generic netlink
2026-02-28 8:08 [PATCH v9 0/5] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
@ 2026-02-28 8:08 ` Riana Tauro
2026-02-28 16:04 ` Jakub Kicinski
2026-03-03 4:32 ` Claude review: " Claude Code Review Bot
2026-02-28 8:08 ` [PATCH v9 2/5] drm/xe/xe_drm_ras: Add support for XE DRM RAS Riana Tauro
` (4 subsequent siblings)
5 siblings, 2 replies; 15+ messages in thread
From: Riana Tauro @ 2026-02-28 8:08 UTC (permalink / raw)
To: intel-xe, dri-devel
Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
raag.jadav, anvesh.bakwad, Zack McKevitt, Lijo Lazar,
Hawking Zhang, Jakub Kicinski, David S. Miller, Paolo Abeni,
Eric Dumazet, netdev, Riana Tauro
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
Introduces the DRM RAS infrastructure over generic netlink.
The new interface allows drivers to expose RAS nodes and their
associated error counters to userspace in a structured and extensible
way. Each drm_ras node can register its own set of error counters, which
are then discoverable and queryable through netlink operations. This
lays the groundwork for reporting and managing hardware error states
in a unified manner across different DRM drivers.
Currently it only supports error-counter nodes. But it can be
extended later.
The registration is also not tied to any drm node, so it can be
used by accel devices as well.
It uses the new and mandatory YAML description format stored in
Documentation/netlink/specs/. This forces a single generic netlink
family namespace for the entire drm: "drm-ras".
But multiple-endpoints are supported within the single family.
Any modification to this API needs to be applied to
Documentation/netlink/specs/drm_ras.yaml before regenerating the
code:
$ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
Documentation/netlink/specs/drm_ras.yaml --mode uapi --header \
-o include/uapi/drm/drm_ras.h
$ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
Documentation/netlink/specs/drm_ras.yaml --mode kernel \
--header -o drivers/gpu/drm/drm_ras_nl.h
$ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
Documentation/netlink/specs/drm_ras.yaml \
--mode kernel --source -o drivers/gpu/drm/drm_ras_nl.c
Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
Cc: Lijo Lazar <lijo.lazar@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: netdev@vger.kernel.org
Co-developed-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
Reviewed-by: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
---
v2: fix doc and memory leak
use xe_for_each_start
use standard genlmsg_iput (Jakub Kicinski)
v3: add documentation to index
modify documentation to mention uAPI requirements (Rodrigo)
v4: fix typo (Zack)
v5: fix kernel-robot reported htmldocs error
fix kernel-robot reported build error on csky arch
v6: regenerate files
fix yamllint warnings
use get-error-counter for both do/dump function calls
fix kernel-doc (Jakub)
v7: Fix ynl-regen.sh reported issue
move drm_ras_nl.h from include/ to drm/
use -o to generate files and fix commit message (Jakub)
---
Documentation/gpu/drm-ras.rst | 103 +++++++
Documentation/gpu/index.rst | 1 +
Documentation/netlink/specs/drm_ras.yaml | 115 ++++++++
drivers/gpu/drm/Kconfig | 10 +
drivers/gpu/drm/Makefile | 1 +
drivers/gpu/drm/drm_drv.c | 6 +
drivers/gpu/drm/drm_ras.c | 354 +++++++++++++++++++++++
drivers/gpu/drm/drm_ras_genl_family.c | 42 +++
drivers/gpu/drm/drm_ras_nl.c | 56 ++++
drivers/gpu/drm/drm_ras_nl.h | 24 ++
include/drm/drm_ras.h | 75 +++++
include/drm/drm_ras_genl_family.h | 17 ++
include/uapi/drm/drm_ras.h | 49 ++++
13 files changed, 853 insertions(+)
create mode 100644 Documentation/gpu/drm-ras.rst
create mode 100644 Documentation/netlink/specs/drm_ras.yaml
create mode 100644 drivers/gpu/drm/drm_ras.c
create mode 100644 drivers/gpu/drm/drm_ras_genl_family.c
create mode 100644 drivers/gpu/drm/drm_ras_nl.c
create mode 100644 drivers/gpu/drm/drm_ras_nl.h
create mode 100644 include/drm/drm_ras.h
create mode 100644 include/drm/drm_ras_genl_family.h
create mode 100644 include/uapi/drm/drm_ras.h
diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
new file mode 100644
index 000000000000..70b246a78fc8
--- /dev/null
+++ b/Documentation/gpu/drm-ras.rst
@@ -0,0 +1,103 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+============================
+DRM RAS over Generic Netlink
+============================
+
+The DRM RAS (Reliability, Availability, Serviceability) interface provides a
+standardized way for GPU/accelerator drivers to expose error counters and
+other reliability nodes to user space via Generic Netlink. This allows
+diagnostic tools, monitoring daemons, or test infrastructure to query hardware
+health in a uniform way across different DRM drivers.
+
+Key Goals:
+
+* Provide a standardized RAS solution for GPU and accelerator drivers, enabling
+ data center monitoring and reliability operations.
+* Implement a single drm-ras Generic Netlink family to meet modern Netlink YAML
+ specifications and centralize all RAS-related communication in one namespace.
+* Support a basic error counter interface, addressing the immediate, essential
+ monitoring needs.
+* Offer a flexible, future-proof interface that can be extended to support
+ additional types of RAS data in the future.
+* Allow multiple nodes per driver, enabling drivers to register separate
+ nodes for different IP blocks, sub-blocks, or other logical subdivisions
+ as applicable.
+
+Nodes
+=====
+
+Nodes are logical abstractions representing an error type or error source within
+the device. Currently, only error counter nodes is supported.
+
+Drivers are responsible for registering and unregistering nodes via the
+`drm_ras_node_register()` and `drm_ras_node_unregister()` APIs.
+
+Node Management
+-------------------
+
+.. kernel-doc:: drivers/gpu/drm/drm_ras.c
+ :doc: DRM RAS Node Management
+.. kernel-doc:: drivers/gpu/drm/drm_ras.c
+ :internal:
+
+Generic Netlink Usage
+=====================
+
+The interface is implemented as a Generic Netlink family named ``drm-ras``.
+User space tools can:
+
+* List registered nodes with the ``list-nodes`` command.
+* List all error counters in an node with the ``get-error-counter`` command with ``node-id``
+ as a parameter.
+* Query specific error counter values with the ``get-error-counter`` command, using both
+ ``node-id`` and ``error-id`` as parameters.
+
+YAML-based Interface
+--------------------
+
+The interface is described in a YAML specification ``Documentation/netlink/specs/drm_ras.yaml``
+
+This YAML is used to auto-generate user space bindings via
+``tools/net/ynl/pyynl/ynl_gen_c.py``, and drives the structure of netlink
+attributes and operations.
+
+Usage Notes
+-----------
+
+* User space must first enumerate nodes to obtain their IDs.
+* Node IDs or Node names can be used for all further queries, such as error counters.
+* Error counters can be queried by either the Error ID or Error name.
+* Query Parameters should be defined as part of the uAPI to ensure user interface stability.
+* The interface supports future extension by adding new node types and
+ additional attributes.
+
+Example: List nodes using ynl
+
+.. code-block:: bash
+
+ sudo ynl --family drm_ras --dump list-nodes
+ [{'device-name': '0000:03:00.0',
+ 'node-id': 0,
+ 'node-name': 'correctable-errors',
+ 'node-type': 'error-counter'},
+ {'device-name': '0000:03:00.0',
+ 'node-id': 1,
+ 'node-name': 'uncorrectable-errors',
+ 'node-type': 'error-counter'}]
+
+Example: List all error counters using ynl
+
+.. code-block:: bash
+
+ sudo ynl --family drm_ras --dump get-error-counter --json '{"node-id":0}'
+ [{'error-id': 1, 'error-name': 'error_name1', 'error-value': 0},
+ {'error-id': 2, 'error-name': 'error_name2', 'error-value': 0}]
+
+Example: Query an error counter for a given node
+
+.. code-block:: bash
+
+ sudo ynl --family drm_ras --do get-error-counter --json '{"node-id":0, "error-id":1}'
+ {'error-id': 1, 'error-name': 'error_name1', 'error-value': 0}
+
diff --git a/Documentation/gpu/index.rst b/Documentation/gpu/index.rst
index 2fafa1f35ef3..5d708a106b3f 100644
--- a/Documentation/gpu/index.rst
+++ b/Documentation/gpu/index.rst
@@ -9,6 +9,7 @@ GPU Driver Developer's Guide
drm-mm
drm-kms
drm-kms-helpers
+ drm-ras
drm-uapi
drm-usage-stats
driver-uapi
diff --git a/Documentation/netlink/specs/drm_ras.yaml b/Documentation/netlink/specs/drm_ras.yaml
new file mode 100644
index 000000000000..79af25dac3c5
--- /dev/null
+++ b/Documentation/netlink/specs/drm_ras.yaml
@@ -0,0 +1,115 @@
+# SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
+---
+name: drm-ras
+protocol: genetlink
+uapi-header: drm/drm_ras.h
+
+doc: >-
+ DRM RAS (Reliability, Availability, Serviceability) over Generic Netlink.
+ Provides a standardized mechanism for DRM drivers to register "nodes"
+ representing hardware/software components capable of reporting error counters.
+ Userspace tools can query the list of nodes or individual error counters
+ via the Generic Netlink interface.
+
+definitions:
+ -
+ type: enum
+ name: node-type
+ value-start: 1
+ entries: [error-counter]
+ doc: >-
+ Type of the node. Currently, only error-counter nodes are
+ supported, which expose reliability counters for a hardware/software
+ component.
+
+attribute-sets:
+ -
+ name: node-attrs
+ attributes:
+ -
+ name: node-id
+ type: u32
+ doc: >-
+ Unique identifier for the node.
+ Assigned dynamically by the DRM RAS core upon registration.
+ -
+ name: device-name
+ type: string
+ doc: >-
+ Device name chosen by the driver at registration.
+ Can be a PCI BDF, UUID, or module name if unique.
+ -
+ name: node-name
+ type: string
+ doc: >-
+ Node name chosen by the driver at registration.
+ Can be an IP block name, or any name that identifies the
+ RAS node inside the device.
+ -
+ name: node-type
+ type: u32
+ doc: Type of this node, identifying its function.
+ enum: node-type
+ -
+ name: error-counter-attrs
+ attributes:
+ -
+ name: node-id
+ type: u32
+ doc: Node ID targeted by this error counter operation.
+ -
+ name: error-id
+ type: u32
+ doc: Unique identifier for a specific error counter within an node.
+ -
+ name: error-name
+ type: string
+ doc: Name of the error.
+ -
+ name: error-value
+ type: u32
+ doc: Current value of the requested error counter.
+
+operations:
+ list:
+ -
+ name: list-nodes
+ doc: >-
+ Retrieve the full list of currently registered DRM RAS nodes.
+ Each node includes its dynamically assigned ID, name, and type.
+ **Important:** User space must call this operation first to obtain
+ the node IDs. These IDs are required for all subsequent
+ operations on nodes, such as querying error counters.
+ attribute-set: node-attrs
+ flags: [admin-perm]
+ dump:
+ reply:
+ attributes:
+ - node-id
+ - device-name
+ - node-name
+ - node-type
+ -
+ name: get-error-counter
+ doc: >-
+ Retrieve error counter for a given node.
+ The response includes the id, the name, and even the current
+ value of each counter.
+ attribute-set: error-counter-attrs
+ flags: [admin-perm]
+ do:
+ request:
+ attributes:
+ - node-id
+ - error-id
+ reply:
+ attributes: &errorinfo
+ - error-id
+ - error-name
+ - error-value
+ dump:
+ request:
+ attributes:
+ - node-id
+ reply:
+ attributes: *errorinfo
diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index 0d0657dd1b41..5386248e75b6 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -130,6 +130,16 @@ config DRM_PANIC_SCREEN_QR_VERSION
Smaller QR code are easier to read, but will contain less debugging
data. Default is 40.
+config DRM_RAS
+ bool "DRM RAS support"
+ depends on DRM
+ depends on NET
+ help
+ Enables the DRM RAS (Reliability, Availability and Serviceability)
+ support for DRM drivers. This provides a Generic Netlink interface
+ for error reporting and queries.
+ If in doubt, say "N".
+
config DRM_DEBUG_DP_MST_TOPOLOGY_REFS
bool "Enable refcount backtrace history in the DP MST helpers"
depends on STACKTRACE_SUPPORT
diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
index aba4bf542a35..e97faabcd783 100644
--- a/drivers/gpu/drm/Makefile
+++ b/drivers/gpu/drm/Makefile
@@ -93,6 +93,7 @@ drm-$(CONFIG_DRM_ACCEL) += ../../accel/drm_accel.o
drm-$(CONFIG_DRM_PANIC) += drm_panic.o
drm-$(CONFIG_DRM_DRAW) += drm_draw.o
drm-$(CONFIG_DRM_PANIC_SCREEN_QR_CODE) += drm_panic_qr.o
+drm-$(CONFIG_DRM_RAS) += drm_ras.o drm_ras_nl.o drm_ras_genl_family.o
obj-$(CONFIG_DRM) += drm.o
obj-$(CONFIG_DRM_PANEL_ORIENTATION_QUIRKS) += drm_panel_orientation_quirks.o
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index 2915118436ce..6b965c3d3307 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -53,6 +53,7 @@
#include <drm/drm_panic.h>
#include <drm/drm_print.h>
#include <drm/drm_privacy_screen_machine.h>
+#include <drm/drm_ras_genl_family.h>
#include "drm_crtc_internal.h"
#include "drm_internal.h"
@@ -1223,6 +1224,7 @@ static const struct file_operations drm_stub_fops = {
static void drm_core_exit(void)
{
+ drm_ras_genl_family_unregister();
drm_privacy_screen_lookup_exit();
drm_panic_exit();
accel_core_exit();
@@ -1261,6 +1263,10 @@ static int __init drm_core_init(void)
drm_privacy_screen_lookup_init();
+ ret = drm_ras_genl_family_register();
+ if (ret < 0)
+ goto error;
+
drm_core_init_complete = true;
DRM_DEBUG("Initialized\n");
diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
new file mode 100644
index 000000000000..23d2c8159e09
--- /dev/null
+++ b/drivers/gpu/drm/drm_ras.c
@@ -0,0 +1,354 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2026 Intel Corporation
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/netdevice.h>
+#include <linux/xarray.h>
+#include <net/genetlink.h>
+
+#include <drm/drm_ras.h>
+
+#include "drm_ras_nl.h"
+
+/**
+ * DOC: DRM RAS Node Management
+ *
+ * This module provides the infrastructure to manage RAS (Reliability,
+ * Availability, and Serviceability) nodes for DRM drivers. Each
+ * DRM driver may register one or more RAS nodes, which represent
+ * logical components capable of reporting error counters and other
+ * reliability metrics.
+ *
+ * The nodes are stored in a global xarray `drm_ras_xa` to allow
+ * efficient lookup by ID. Nodes can be registered or unregistered
+ * dynamically at runtime.
+ *
+ * A Generic Netlink family `drm_ras` exposes two main operations to
+ * userspace:
+ *
+ * 1. LIST_NODES: Dump all currently registered RAS nodes.
+ * The user receives an array of node IDs, names, and types.
+ *
+ * 2. GET_ERROR_COUNTER: Get error counters of a given node.
+ * Userspace must provide Node ID, Error ID (Optional for specific counter).
+ * Returns all counters of a node if only Node ID is provided or specific
+ * error counters.
+ *
+ * Node registration:
+ *
+ * - drm_ras_node_register(): Registers a new node and assigns
+ * it a unique ID in the xarray.
+ * - drm_ras_node_unregister(): Removes a previously registered
+ * node from the xarray.
+ *
+ * Node type:
+ *
+ * - ERROR_COUNTER:
+ * + Currently, only error counters are supported.
+ * + The driver must implement the query_error_counter() callback to provide
+ * the name and the value of the error counter.
+ * + The driver must provide a error_counter_range.last value informing the
+ * last valid error ID.
+ * + The driver can provide a error_counter_range.first value informing the
+ * first valid error ID.
+ * + The error counters in the driver doesn't need to be contiguous, but the
+ * driver must return -ENOENT to the query_error_counter as an indication
+ * that the ID should be skipped and not listed in the netlink API.
+ *
+ * Netlink handlers:
+ *
+ * - drm_ras_nl_list_nodes_dumpit(): Implements the LIST_NODES
+ * operation, iterating over the xarray.
+ * - drm_ras_nl_get_error_counter_dumpit(): Implements the GET_ERROR_COUNTER dumpit
+ * operation, fetching all counters from a specific node.
+ * - drm_ras_nl_get_error_counter_doit(): Implements the GET_ERROR_COUNTER doit
+ * operation, fetching a counter value from a specific node.
+ */
+
+static DEFINE_XARRAY_ALLOC(drm_ras_xa);
+
+/*
+ * The netlink callback context carries dump state across multiple dumpit calls
+ */
+struct drm_ras_ctx {
+ /* Which xarray id to restart the dump from */
+ unsigned long restart;
+};
+
+/**
+ * drm_ras_nl_list_nodes_dumpit() - Dump all registered RAS nodes
+ * @skb: Netlink message buffer
+ * @cb: Callback context for multi-part dumps
+ *
+ * Iterates over all registered RAS nodes in the global xarray and appends
+ * their attributes (ID, name, type) to the given netlink message buffer.
+ * Uses @cb->ctx to track progress in case the message buffer fills up, allowing
+ * multi-part dump support. On buffer overflow, updates the context to resume
+ * from the last node on the next invocation.
+ *
+ * Return: 0 if all nodes fit in @skb, number of bytes added to @skb if
+ * the buffer filled up (requires multi-part continuation), or
+ * a negative error code on failure.
+ */
+int drm_ras_nl_list_nodes_dumpit(struct sk_buff *skb,
+ struct netlink_callback *cb)
+{
+ const struct genl_info *info = genl_info_dump(cb);
+ struct drm_ras_ctx *ctx = (void *)cb->ctx;
+ struct drm_ras_node *node;
+ struct nlattr *hdr;
+ unsigned long id;
+ int ret;
+
+ xa_for_each_start(&drm_ras_xa, id, node, ctx->restart) {
+ hdr = genlmsg_iput(skb, info);
+ if (!hdr) {
+ ret = -EMSGSIZE;
+ break;
+ }
+
+ ret = nla_put_u32(skb, DRM_RAS_A_NODE_ATTRS_NODE_ID, node->id);
+ if (ret) {
+ genlmsg_cancel(skb, hdr);
+ break;
+ }
+
+ ret = nla_put_string(skb, DRM_RAS_A_NODE_ATTRS_DEVICE_NAME,
+ node->device_name);
+ if (ret) {
+ genlmsg_cancel(skb, hdr);
+ break;
+ }
+
+ ret = nla_put_string(skb, DRM_RAS_A_NODE_ATTRS_NODE_NAME,
+ node->node_name);
+ if (ret) {
+ genlmsg_cancel(skb, hdr);
+ break;
+ }
+
+ ret = nla_put_u32(skb, DRM_RAS_A_NODE_ATTRS_NODE_TYPE,
+ node->type);
+ if (ret) {
+ genlmsg_cancel(skb, hdr);
+ break;
+ }
+
+ genlmsg_end(skb, hdr);
+ }
+
+ if (ret == -EMSGSIZE)
+ ctx->restart = id;
+
+ return ret;
+}
+
+static int get_node_error_counter(u32 node_id, u32 error_id,
+ const char **name, u32 *value)
+{
+ struct drm_ras_node *node;
+
+ node = xa_load(&drm_ras_xa, node_id);
+ if (!node || !node->query_error_counter)
+ return -ENOENT;
+
+ if (error_id < node->error_counter_range.first ||
+ error_id > node->error_counter_range.last)
+ return -EINVAL;
+
+ return node->query_error_counter(node, error_id, name, value);
+}
+
+static int msg_reply_value(struct sk_buff *msg, u32 error_id,
+ const char *error_name, u32 value)
+{
+ int ret;
+
+ ret = nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID, error_id);
+ if (ret)
+ return ret;
+
+ ret = nla_put_string(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_NAME,
+ error_name);
+ if (ret)
+ return ret;
+
+ return nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_VALUE,
+ value);
+}
+
+static int doit_reply_value(struct genl_info *info, u32 node_id,
+ u32 error_id)
+{
+ struct sk_buff *msg;
+ struct nlattr *hdr;
+ const char *error_name;
+ u32 value;
+ int ret;
+
+ msg = genlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
+ if (!msg)
+ return -ENOMEM;
+
+ hdr = genlmsg_iput(msg, info);
+ if (!hdr) {
+ nlmsg_free(msg);
+ return -EMSGSIZE;
+ }
+
+ ret = get_node_error_counter(node_id, error_id,
+ &error_name, &value);
+ if (ret)
+ return ret;
+
+ ret = msg_reply_value(msg, error_id, error_name, value);
+ if (ret) {
+ genlmsg_cancel(msg, hdr);
+ nlmsg_free(msg);
+ return ret;
+ }
+
+ genlmsg_end(msg, hdr);
+
+ return genlmsg_reply(msg, info);
+}
+
+/**
+ * drm_ras_nl_get_error_counter_dumpit() - Dump all Error Counters
+ * @skb: Netlink message buffer
+ * @cb: Callback context for multi-part dumps
+ *
+ * Iterates over all error counters in a given Node and appends
+ * their attributes (ID, name, value) to the given netlink message buffer.
+ * Uses @cb->ctx to track progress in case the message buffer fills up, allowing
+ * multi-part dump support. On buffer overflow, updates the context to resume
+ * from the last node on the next invocation.
+ *
+ * Return: 0 if all errors fit in @skb, number of bytes added to @skb if
+ * the buffer filled up (requires multi-part continuation), or
+ * a negative error code on failure.
+ */
+int drm_ras_nl_get_error_counter_dumpit(struct sk_buff *skb,
+ struct netlink_callback *cb)
+{
+ const struct genl_info *info = genl_info_dump(cb);
+ struct drm_ras_ctx *ctx = (void *)cb->ctx;
+ struct drm_ras_node *node;
+ struct nlattr *hdr;
+ const char *error_name;
+ u32 node_id, error_id, value;
+ int ret;
+
+ if (!info->attrs || !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID])
+ return -EINVAL;
+
+ node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
+
+ node = xa_load(&drm_ras_xa, node_id);
+ if (!node)
+ return -ENOENT;
+
+ for (error_id = max(node->error_counter_range.first, ctx->restart);
+ error_id <= node->error_counter_range.last;
+ error_id++) {
+ ret = get_node_error_counter(node_id, error_id,
+ &error_name, &value);
+ /*
+ * For non-contiguous range, driver return -ENOENT as indication
+ * to skip this ID when listing all errors.
+ */
+ if (ret == -ENOENT)
+ continue;
+ if (ret)
+ return ret;
+
+ hdr = genlmsg_iput(skb, info);
+
+ if (!hdr) {
+ ret = -EMSGSIZE;
+ break;
+ }
+
+ ret = msg_reply_value(skb, error_id, error_name, value);
+ if (ret) {
+ genlmsg_cancel(skb, hdr);
+ break;
+ }
+
+ genlmsg_end(skb, hdr);
+ }
+
+ if (ret == -EMSGSIZE)
+ ctx->restart = error_id;
+
+ return ret;
+}
+
+/**
+ * drm_ras_nl_get_error_counter_doit() - Query an error counter of an node
+ * @skb: Netlink message buffer
+ * @info: Generic Netlink info containing attributes of the request
+ *
+ * Extracts the node ID and error ID from the netlink attributes and
+ * retrieves the current value of the corresponding error counter. Sends the
+ * result back to the requesting user via the standard Genl reply.
+ *
+ * Return: 0 on success, or negative errno on failure.
+ */
+int drm_ras_nl_get_error_counter_doit(struct sk_buff *skb,
+ struct genl_info *info)
+{
+ u32 node_id, error_id;
+
+ if (!info->attrs ||
+ !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] ||
+ !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID])
+ return -EINVAL;
+
+ node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
+ error_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]);
+
+ return doit_reply_value(info, node_id, error_id);
+}
+
+/**
+ * drm_ras_node_register() - Register a new RAS node
+ * @node: Node structure to register
+ *
+ * Adds the given RAS node to the global node xarray and assigns it
+ * a unique ID. Both @node->name and @node->type must be valid.
+ *
+ * Return: 0 on success, or negative errno on failure:
+ */
+int drm_ras_node_register(struct drm_ras_node *node)
+{
+ if (!node->device_name || !node->node_name)
+ return -EINVAL;
+
+ /* Currently, only Error Counter Endpoints are supported */
+ if (node->type != DRM_RAS_NODE_TYPE_ERROR_COUNTER)
+ return -EINVAL;
+
+ /* Mandatory entries for Error Counter Node */
+ if (node->type == DRM_RAS_NODE_TYPE_ERROR_COUNTER &&
+ (!node->error_counter_range.last || !node->query_error_counter))
+ return -EINVAL;
+
+ return xa_alloc(&drm_ras_xa, &node->id, node, xa_limit_32b, GFP_KERNEL);
+}
+EXPORT_SYMBOL(drm_ras_node_register);
+
+/**
+ * drm_ras_node_unregister() - Unregister a previously registered node
+ * @node: Node structure to unregister
+ *
+ * Removes the given node from the global node xarray using its ID.
+ */
+void drm_ras_node_unregister(struct drm_ras_node *node)
+{
+ xa_erase(&drm_ras_xa, node->id);
+}
+EXPORT_SYMBOL(drm_ras_node_unregister);
diff --git a/drivers/gpu/drm/drm_ras_genl_family.c b/drivers/gpu/drm/drm_ras_genl_family.c
new file mode 100644
index 000000000000..6f406d3d48c5
--- /dev/null
+++ b/drivers/gpu/drm/drm_ras_genl_family.c
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2026 Intel Corporation
+ */
+
+#include <drm/drm_ras_genl_family.h>
+#include "drm_ras_nl.h"
+
+/* Track family registration so the drm_exit can be called at any time */
+static bool registered;
+
+/**
+ * drm_ras_genl_family_register() - Register drm-ras genl family
+ *
+ * Only to be called one at drm_drv_init()
+ */
+int drm_ras_genl_family_register(void)
+{
+ int ret;
+
+ registered = false;
+
+ ret = genl_register_family(&drm_ras_nl_family);
+ if (ret)
+ return ret;
+
+ registered = true;
+ return 0;
+}
+
+/**
+ * drm_ras_genl_family_unregister() - Unregister drm-ras genl family
+ *
+ * To be called one at drm_drv_exit() at any moment, but only once.
+ */
+void drm_ras_genl_family_unregister(void)
+{
+ if (registered) {
+ genl_unregister_family(&drm_ras_nl_family);
+ registered = false;
+ }
+}
diff --git a/drivers/gpu/drm/drm_ras_nl.c b/drivers/gpu/drm/drm_ras_nl.c
new file mode 100644
index 000000000000..16803d0c4a44
--- /dev/null
+++ b/drivers/gpu/drm/drm_ras_nl.c
@@ -0,0 +1,56 @@
+// SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
+/* Do not edit directly, auto-generated from: */
+/* Documentation/netlink/specs/drm_ras.yaml */
+/* YNL-GEN kernel source */
+/* To regenerate run: tools/net/ynl/ynl-regen.sh */
+
+#include <net/netlink.h>
+#include <net/genetlink.h>
+
+#include "drm_ras_nl.h"
+
+#include <uapi/drm/drm_ras.h>
+
+/* DRM_RAS_CMD_GET_ERROR_COUNTER - do */
+static const struct nla_policy drm_ras_get_error_counter_do_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID + 1] = {
+ [DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
+ [DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
+};
+
+/* DRM_RAS_CMD_GET_ERROR_COUNTER - dump */
+static const struct nla_policy drm_ras_get_error_counter_dump_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID + 1] = {
+ [DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
+};
+
+/* Ops table for drm_ras */
+static const struct genl_split_ops drm_ras_nl_ops[] = {
+ {
+ .cmd = DRM_RAS_CMD_LIST_NODES,
+ .dumpit = drm_ras_nl_list_nodes_dumpit,
+ .flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
+ },
+ {
+ .cmd = DRM_RAS_CMD_GET_ERROR_COUNTER,
+ .doit = drm_ras_nl_get_error_counter_doit,
+ .policy = drm_ras_get_error_counter_do_nl_policy,
+ .maxattr = DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
+ .flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
+ },
+ {
+ .cmd = DRM_RAS_CMD_GET_ERROR_COUNTER,
+ .dumpit = drm_ras_nl_get_error_counter_dumpit,
+ .policy = drm_ras_get_error_counter_dump_nl_policy,
+ .maxattr = DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID,
+ .flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
+ },
+};
+
+struct genl_family drm_ras_nl_family __ro_after_init = {
+ .name = DRM_RAS_FAMILY_NAME,
+ .version = DRM_RAS_FAMILY_VERSION,
+ .netnsok = true,
+ .parallel_ops = true,
+ .module = THIS_MODULE,
+ .split_ops = drm_ras_nl_ops,
+ .n_split_ops = ARRAY_SIZE(drm_ras_nl_ops),
+};
diff --git a/drivers/gpu/drm/drm_ras_nl.h b/drivers/gpu/drm/drm_ras_nl.h
new file mode 100644
index 000000000000..06ccd9342773
--- /dev/null
+++ b/drivers/gpu/drm/drm_ras_nl.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */
+/* Do not edit directly, auto-generated from: */
+/* Documentation/netlink/specs/drm_ras.yaml */
+/* YNL-GEN kernel header */
+/* To regenerate run: tools/net/ynl/ynl-regen.sh */
+
+#ifndef _LINUX_DRM_RAS_GEN_H
+#define _LINUX_DRM_RAS_GEN_H
+
+#include <net/netlink.h>
+#include <net/genetlink.h>
+
+#include <uapi/drm/drm_ras.h>
+
+int drm_ras_nl_list_nodes_dumpit(struct sk_buff *skb,
+ struct netlink_callback *cb);
+int drm_ras_nl_get_error_counter_doit(struct sk_buff *skb,
+ struct genl_info *info);
+int drm_ras_nl_get_error_counter_dumpit(struct sk_buff *skb,
+ struct netlink_callback *cb);
+
+extern struct genl_family drm_ras_nl_family;
+
+#endif /* _LINUX_DRM_RAS_GEN_H */
diff --git a/include/drm/drm_ras.h b/include/drm/drm_ras.h
new file mode 100644
index 000000000000..5d50209e51db
--- /dev/null
+++ b/include/drm/drm_ras.h
@@ -0,0 +1,75 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2026 Intel Corporation
+ */
+
+#ifndef __DRM_RAS_H__
+#define __DRM_RAS_H__
+
+#include <uapi/drm/drm_ras.h>
+
+/**
+ * struct drm_ras_node - A DRM RAS Node
+ */
+struct drm_ras_node {
+ /** @id: Unique identifier for the node. Dynamically assigned. */
+ u32 id;
+ /**
+ * @device_name: Human-readable name of the device. Given by the driver.
+ */
+ const char *device_name;
+ /** @node_name: Human-readable name of the node. Given by the driver. */
+ const char *node_name;
+ /** @type: Type of the node (enum drm_ras_node_type). */
+ enum drm_ras_node_type type;
+
+ /* Error-Counter Related Callback and Variables */
+
+ /** @error_counter_range: Range of valid Error IDs for this node. */
+ struct {
+ /** @first: First valid Error ID. */
+ u32 first;
+ /** @last: Last valid Error ID. Mandatory entry. */
+ u32 last;
+ } error_counter_range;
+
+ /**
+ * @query_error_counter:
+ *
+ * This callback is used by drm-ras to query a specific error counter.
+ * Used for input check and to iterate all error counters in a node.
+ *
+ * Driver should expect query_error_counter() to be called with
+ * error_id from `error_counter_range.first` to
+ * `error_counter_range.last`.
+ *
+ * The @query_error_counter is a mandatory callback for
+ * error_counter_node.
+ *
+ * Returns: 0 on success,
+ * -ENOENT when error_id is not supported as an indication that
+ * drm_ras should silently skip this entry. Used for
+ * supporting non-contiguous error ranges.
+ * Driver is responsible for maintaining the list of
+ * supported error IDs in the range of first to last.
+ * Other negative values on errors that should terminate the
+ * netlink query.
+ */
+ int (*query_error_counter)(struct drm_ras_node *node, u32 error_id,
+ const char **name, u32 *val);
+
+ /** @priv: Driver private data */
+ void *priv;
+};
+
+struct drm_device;
+
+#if IS_ENABLED(CONFIG_DRM_RAS)
+int drm_ras_node_register(struct drm_ras_node *node);
+void drm_ras_node_unregister(struct drm_ras_node *node);
+#else
+static inline int drm_ras_node_register(struct drm_ras_node *node) { return 0; }
+static inline void drm_ras_node_unregister(struct drm_ras_node *node) { }
+#endif
+
+#endif
diff --git a/include/drm/drm_ras_genl_family.h b/include/drm/drm_ras_genl_family.h
new file mode 100644
index 000000000000..910fb3943a75
--- /dev/null
+++ b/include/drm/drm_ras_genl_family.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2026 Intel Corporation
+ */
+
+#ifndef __DRM_RAS_GENL_FAMILY_H__
+#define __DRM_RAS_GENL_FAMILY_H__
+
+#if IS_ENABLED(CONFIG_DRM_RAS)
+int drm_ras_genl_family_register(void);
+void drm_ras_genl_family_unregister(void);
+#else
+static inline int drm_ras_genl_family_register(void) { return 0; }
+static inline void drm_ras_genl_family_unregister(void) { }
+#endif
+
+#endif
diff --git a/include/uapi/drm/drm_ras.h b/include/uapi/drm/drm_ras.h
new file mode 100644
index 000000000000..5f40fa5b869d
--- /dev/null
+++ b/include/uapi/drm/drm_ras.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */
+/* Do not edit directly, auto-generated from: */
+/* Documentation/netlink/specs/drm_ras.yaml */
+/* YNL-GEN uapi header */
+/* To regenerate run: tools/net/ynl/ynl-regen.sh */
+
+#ifndef _UAPI_LINUX_DRM_RAS_H
+#define _UAPI_LINUX_DRM_RAS_H
+
+#define DRM_RAS_FAMILY_NAME "drm-ras"
+#define DRM_RAS_FAMILY_VERSION 1
+
+/*
+ * Type of the node. Currently, only error-counter nodes are supported, which
+ * expose reliability counters for a hardware/software component.
+ */
+enum drm_ras_node_type {
+ DRM_RAS_NODE_TYPE_ERROR_COUNTER = 1,
+};
+
+enum {
+ DRM_RAS_A_NODE_ATTRS_NODE_ID = 1,
+ DRM_RAS_A_NODE_ATTRS_DEVICE_NAME,
+ DRM_RAS_A_NODE_ATTRS_NODE_NAME,
+ DRM_RAS_A_NODE_ATTRS_NODE_TYPE,
+
+ __DRM_RAS_A_NODE_ATTRS_MAX,
+ DRM_RAS_A_NODE_ATTRS_MAX = (__DRM_RAS_A_NODE_ATTRS_MAX - 1)
+};
+
+enum {
+ DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID = 1,
+ DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
+ DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_NAME,
+ DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_VALUE,
+
+ __DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX,
+ DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX = (__DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX - 1)
+};
+
+enum {
+ DRM_RAS_CMD_LIST_NODES = 1,
+ DRM_RAS_CMD_GET_ERROR_COUNTER,
+
+ __DRM_RAS_CMD_MAX,
+ DRM_RAS_CMD_MAX = (__DRM_RAS_CMD_MAX - 1)
+};
+
+#endif /* _UAPI_LINUX_DRM_RAS_H */
--
2.47.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v9 2/5] drm/xe/xe_drm_ras: Add support for XE DRM RAS
2026-02-28 8:08 [PATCH v9 0/5] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
2026-02-28 8:08 ` [PATCH v9 1/5] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Riana Tauro
@ 2026-02-28 8:08 ` Riana Tauro
2026-03-03 4:32 ` Claude review: " Claude Code Review Bot
2026-02-28 8:08 ` [PATCH v9 3/5] drm/xe/xe_hw_error: Integrate DRM RAS with hardware error handling Riana Tauro
` (3 subsequent siblings)
5 siblings, 1 reply; 15+ messages in thread
From: Riana Tauro @ 2026-02-28 8:08 UTC (permalink / raw)
To: intel-xe, dri-devel
Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
raag.jadav, anvesh.bakwad, Riana Tauro
Allocate correctable, uncorrectable nodes for every xe device. Each node
contains error component, counters and respective query counter functions.
Add basic functionality to create and register drm nodes.
Below operations can be performed using Generic netlink DRM RAS interface:
1) List Nodes:
$ sudo ynl --family drm_ras --dump list-nodes
[{'device-name': '0000:03:00.0',
'node-id': 0,
'node-name': 'correctable-errors',
'node-type': 'error-counter'},
{'device-name': '0000:03:00.0',
'node-id': 1,
'node-name': 'uncorrectable-errors',
'node-type': 'error-counter'}]
2) Get Error counters:
$ sudo ynl --family drm_ras --dump get-error-counter --json '{"node-id":0}'
[{'error-id': 1, 'error-name': 'core-compute', 'error-value': 0},
{'error-id': 2, 'error-name': 'soc-internal', 'error-value': 0}]
3) Get specific Error counter:
$ sudo ynl --family drm_ras --do get-error-counter --json '{"node-id":0, "error-id":1}'
{'error-id': 1, 'error-name': 'core-compute', 'error-value': 0}
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
Reviewed-by: Raag Jadav <raag.jadav@intel.com>
---
v2: Add ID's and names as uAPI (Rodrigo)
Add documentation
Modify commit message
v3: remove 'error' from counters
use drmm_kcalloc
add a for_each for severity
differentitate error classes and severity in uapi (Raag)
Use correctable and uncorrectable in uapi (Pratik / Aravind)
v4: change UAPI enums
split patches
make drm_kcalloc robust
fix function names and memory leak (Raag)
add null pointer check for device_name
start loop counter from 1
v5: reorder linux includes
rename to error_components
use drmm_action for de-registeration
rename xe_drm_ras_allocate_nodes to xe_drm_ras_init
free info if DRM RAS registeration fails (Raag)
v6: cleanup node parameters in LIFO order
print errors for failures (Raag)
---
drivers/gpu/drm/xe/Makefile | 1 +
drivers/gpu/drm/xe/xe_device_types.h | 4 +
drivers/gpu/drm/xe/xe_drm_ras.c | 186 ++++++++++++++++++++++++++
drivers/gpu/drm/xe/xe_drm_ras.h | 15 +++
drivers/gpu/drm/xe/xe_drm_ras_types.h | 40 ++++++
include/uapi/drm/xe_drm.h | 79 +++++++++++
6 files changed, 325 insertions(+)
create mode 100644 drivers/gpu/drm/xe/xe_drm_ras.c
create mode 100644 drivers/gpu/drm/xe/xe_drm_ras.h
create mode 100644 drivers/gpu/drm/xe/xe_drm_ras_types.h
diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index ff778fb2d4ff..f63fdf80055a 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -41,6 +41,7 @@ xe-y += xe_bb.o \
xe_device_sysfs.o \
xe_dma_buf.o \
xe_drm_client.o \
+ xe_drm_ras.o \
xe_eu_stall.o \
xe_exec.o \
xe_exec_queue.o \
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index 8f3ef836541e..666ca2374f70 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -13,6 +13,7 @@
#include <drm/ttm/ttm_device.h>
#include "xe_devcoredump_types.h"
+#include "xe_drm_ras_types.h"
#include "xe_heci_gsc.h"
#include "xe_late_bind_fw_types.h"
#include "xe_oa_types.h"
@@ -507,6 +508,9 @@ struct xe_device {
/** @pmu: performance monitoring unit */
struct xe_pmu pmu;
+ /** @ras: RAS structure for device */
+ struct xe_drm_ras ras;
+
/** @i2c: I2C host controller */
struct xe_i2c *i2c;
diff --git a/drivers/gpu/drm/xe/xe_drm_ras.c b/drivers/gpu/drm/xe/xe_drm_ras.c
new file mode 100644
index 000000000000..e07dc23a155e
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_drm_ras.c
@@ -0,0 +1,186 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2026 Intel Corporation
+ */
+
+#include <linux/bitmap.h>
+
+#include <drm/drm_managed.h>
+#include <drm/drm_print.h>
+#include <drm/drm_ras.h>
+
+#include "xe_device_types.h"
+#include "xe_drm_ras.h"
+
+static const char * const error_components[] = DRM_XE_RAS_ERROR_COMPONENT_NAMES;
+static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
+
+static int hw_query_error_counter(struct xe_drm_ras_counter *info,
+ u32 error_id, const char **name, u32 *val)
+{
+ if (!info || !info[error_id].name)
+ return -ENOENT;
+
+ *name = info[error_id].name;
+ *val = atomic_read(&info[error_id].counter);
+
+ return 0;
+}
+
+static int query_uncorrectable_error_counter(struct drm_ras_node *ep, u32 error_id,
+ const char **name, u32 *val)
+{
+ struct xe_device *xe = ep->priv;
+ struct xe_drm_ras *ras = &xe->ras;
+ struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_UNCORRECTABLE];
+
+ return hw_query_error_counter(info, error_id, name, val);
+}
+
+static int query_correctable_error_counter(struct drm_ras_node *ep, u32 error_id,
+ const char **name, u32 *val)
+{
+ struct xe_device *xe = ep->priv;
+ struct xe_drm_ras *ras = &xe->ras;
+ struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_CORRECTABLE];
+
+ return hw_query_error_counter(info, error_id, name, val);
+}
+
+static struct xe_drm_ras_counter *allocate_and_copy_counters(struct xe_device *xe)
+{
+ struct xe_drm_ras_counter *counter;
+ int i;
+
+ counter = kcalloc(DRM_XE_RAS_ERR_COMP_MAX, sizeof(*counter), GFP_KERNEL);
+ if (!counter)
+ return ERR_PTR(-ENOMEM);
+
+ for (i = DRM_XE_RAS_ERR_COMP_CORE_COMPUTE; i < DRM_XE_RAS_ERR_COMP_MAX; i++) {
+ if (!error_components[i])
+ continue;
+
+ counter[i].name = error_components[i];
+ atomic_set(&counter[i].counter, 0);
+ }
+
+ return counter;
+}
+
+static int assign_node_params(struct xe_device *xe, struct drm_ras_node *node,
+ const enum drm_xe_ras_error_severity severity)
+{
+ struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
+ struct xe_drm_ras *ras = &xe->ras;
+ const char *device_name;
+
+ device_name = kasprintf(GFP_KERNEL, "%04x:%02x:%02x.%d",
+ pci_domain_nr(pdev->bus), pdev->bus->number,
+ PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
+
+ if (!device_name)
+ return -ENOMEM;
+
+ node->device_name = device_name;
+ node->node_name = error_severity[severity];
+ node->type = DRM_RAS_NODE_TYPE_ERROR_COUNTER;
+ node->error_counter_range.first = DRM_XE_RAS_ERR_COMP_CORE_COMPUTE;
+ node->error_counter_range.last = DRM_XE_RAS_ERR_COMP_MAX - 1;
+ node->priv = xe;
+
+ ras->info[severity] = allocate_and_copy_counters(xe);
+ if (IS_ERR(ras->info[severity]))
+ return PTR_ERR(ras->info[severity]);
+
+ if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE)
+ node->query_error_counter = query_correctable_error_counter;
+ else
+ node->query_error_counter = query_uncorrectable_error_counter;
+
+ return 0;
+}
+
+static void cleanup_node_param(struct xe_drm_ras *ras, const enum drm_xe_ras_error_severity severity)
+{
+ struct drm_ras_node *node = &ras->node[severity];
+
+ kfree(ras->info[severity]);
+ ras->info[severity] = NULL;
+
+ kfree(node->device_name);
+ node->device_name = NULL;
+}
+
+static int register_nodes(struct xe_device *xe)
+{
+ struct xe_drm_ras *ras = &xe->ras;
+ int i;
+
+ for_each_error_severity(i) {
+ struct drm_ras_node *node = &ras->node[i];
+ int ret;
+
+ ret = assign_node_params(xe, node, i);
+ if (ret) {
+ cleanup_node_param(ras, i);
+ return ret;
+ }
+
+ ret = drm_ras_node_register(node);
+ if (ret) {
+ cleanup_node_param(ras, i);
+ return ret;
+ }
+ }
+
+ return 0;
+}
+
+static void xe_drm_ras_unregister_nodes(struct drm_device *device, void *arg)
+{
+ struct xe_device *xe = arg;
+ struct xe_drm_ras *ras = &xe->ras;
+ int i;
+
+ for_each_error_severity(i) {
+ struct drm_ras_node *node = &ras->node[i];
+
+ drm_ras_node_unregister(node);
+ cleanup_node_param(ras, i);
+ }
+}
+
+/**
+ * xe_drm_ras_init() - Initialize DRM RAS
+ * @xe: xe device instance
+ *
+ * Allocate and register DRM RAS nodes per device
+ *
+ * Return: 0 on success, negative error code otherwise.
+ */
+int xe_drm_ras_init(struct xe_device *xe)
+{
+ struct xe_drm_ras *ras = &xe->ras;
+ struct drm_ras_node *node;
+ int err;
+
+ node = drmm_kcalloc(&xe->drm, DRM_XE_RAS_ERR_SEV_MAX, sizeof(*node), GFP_KERNEL);
+ if (!node)
+ return -ENOMEM;
+
+ ras->node = node;
+
+ err = register_nodes(xe);
+ if (err) {
+ drm_err(&xe->drm, "Failed to register DRM RAS nodes (%pe)\n", ERR_PTR(err));
+ return err;
+ }
+
+ err = drmm_add_action_or_reset(&xe->drm, xe_drm_ras_unregister_nodes, xe);
+ if (err) {
+ drm_err(&xe->drm, "Failed to add action for Xe DRM RAS (%pe)\n", ERR_PTR(err));
+ return err;
+ }
+
+ return 0;
+}
diff --git a/drivers/gpu/drm/xe/xe_drm_ras.h b/drivers/gpu/drm/xe/xe_drm_ras.h
new file mode 100644
index 000000000000..5cc8f0124411
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_drm_ras.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2026 Intel Corporation
+ */
+#ifndef XE_DRM_RAS_H_
+#define XE_DRM_RAS_H_
+
+struct xe_device;
+
+#define for_each_error_severity(i) \
+ for (i = 0; i < DRM_XE_RAS_ERR_SEV_MAX; i++)
+
+int xe_drm_ras_init(struct xe_device *xe);
+
+#endif
diff --git a/drivers/gpu/drm/xe/xe_drm_ras_types.h b/drivers/gpu/drm/xe/xe_drm_ras_types.h
new file mode 100644
index 000000000000..7acc5e7377b2
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_drm_ras_types.h
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2026 Intel Corporation
+ */
+
+#ifndef _XE_DRM_RAS_TYPES_H_
+#define _XE_DRM_RAS_TYPES_H_
+
+#include <linux/atomic.h>
+#include <drm/xe_drm.h>
+
+struct drm_ras_node;
+
+/**
+ * struct xe_drm_ras_counter - XE RAS counter
+ *
+ * This structure contains error component and counter information
+ */
+struct xe_drm_ras_counter {
+ /** @name: error component name */
+ const char *name;
+
+ /** @counter: count of error */
+ atomic_t counter;
+};
+
+/**
+ * struct xe_drm_ras - XE DRM RAS structure
+ *
+ * This structure has details of error counters
+ */
+struct xe_drm_ras {
+ /** @node: DRM RAS node */
+ struct drm_ras_node *node;
+
+ /** @info: info array for all types of errors */
+ struct xe_drm_ras_counter *info[DRM_XE_RAS_ERR_SEV_MAX];
+};
+
+#endif
diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
index ef2565048bdf..b0264c32ceb2 100644
--- a/include/uapi/drm/xe_drm.h
+++ b/include/uapi/drm/xe_drm.h
@@ -2357,6 +2357,85 @@ struct drm_xe_exec_queue_set_property {
__u64 reserved[2];
};
+/**
+ * DOC: Xe DRM RAS
+ *
+ * The enums and strings defined below map to the attributes of the DRM RAS Netlink Interface.
+ * Refer to Documentation/netlink/specs/drm_ras.yaml for complete interface specification.
+ *
+ * Node Registration
+ * =================
+ *
+ * The driver registers DRM RAS nodes for each error severity level.
+ * enum drm_xe_ras_error_severity defines the node-id, while DRM_XE_RAS_ERROR_SEVERITY_NAMES maps
+ * node-id to node-name.
+ *
+ * Error Classification
+ * ====================
+ *
+ * Each node contains a list of error counters. Each error is identified by a error-id and
+ * an error-name. enum drm_xe_ras_error_component defines the error-id, while
+ * DRM_XE_RAS_ERROR_COMPONENT_NAMES maps error-id to error-name.
+ *
+ * User Interface
+ * ==============
+ *
+ * To retrieve error values of a error counter, userspace applications should
+ * follow the below steps:
+ *
+ * 1. Use command LIST_NODES to enumerate all available nodes
+ * 2. Select node by node-id or node-name
+ * 3. Use command GET_ERROR_COUNTERS to list errors of specific node
+ * 4. Query specific error values using either error-id or error-name
+ *
+ * .. code-block:: C
+ *
+ * // Lookup tables for ID-to-name resolution
+ * static const char *nodes[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
+ * static const char *errors[] = DRM_XE_RAS_ERROR_COMPONENT_NAMES;
+ *
+ */
+
+/**
+ * enum drm_xe_ras_error_severity - DRM RAS error severity.
+ */
+enum drm_xe_ras_error_severity {
+ /** @DRM_XE_RAS_ERR_SEV_CORRECTABLE: Correctable Error */
+ DRM_XE_RAS_ERR_SEV_CORRECTABLE = 0,
+ /** @DRM_XE_RAS_ERR_SEV_UNCORRECTABLE: Uncorrectable Error */
+ DRM_XE_RAS_ERR_SEV_UNCORRECTABLE,
+ /** @DRM_XE_RAS_ERR_SEV_MAX: Max severity */
+ DRM_XE_RAS_ERR_SEV_MAX /* non-ABI */
+};
+
+/**
+ * enum drm_xe_ras_error_component - DRM RAS error component.
+ */
+enum drm_xe_ras_error_component {
+ /** @DRM_XE_RAS_ERR_COMP_CORE_COMPUTE: Core Compute Error */
+ DRM_XE_RAS_ERR_COMP_CORE_COMPUTE = 1,
+ /** @DRM_XE_RAS_ERR_COMP_SOC_INTERNAL: SoC Internal Error */
+ DRM_XE_RAS_ERR_COMP_SOC_INTERNAL,
+ /** @DRM_XE_RAS_ERR_COMP_MAX: Max Error */
+ DRM_XE_RAS_ERR_COMP_MAX /* non-ABI */
+};
+
+/*
+ * Error severity to name mapping.
+ */
+#define DRM_XE_RAS_ERROR_SEVERITY_NAMES { \
+ [DRM_XE_RAS_ERR_SEV_CORRECTABLE] = "correctable-errors", \
+ [DRM_XE_RAS_ERR_SEV_UNCORRECTABLE] = "uncorrectable-errors", \
+}
+
+/*
+ * Error component to name mapping.
+ */
+#define DRM_XE_RAS_ERROR_COMPONENT_NAMES { \
+ [DRM_XE_RAS_ERR_COMP_CORE_COMPUTE] = "core-compute", \
+ [DRM_XE_RAS_ERR_COMP_SOC_INTERNAL] = "soc-internal" \
+}
+
#if defined(__cplusplus)
}
#endif
--
2.47.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v9 3/5] drm/xe/xe_hw_error: Integrate DRM RAS with hardware error handling
2026-02-28 8:08 [PATCH v9 0/5] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
2026-02-28 8:08 ` [PATCH v9 1/5] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Riana Tauro
2026-02-28 8:08 ` [PATCH v9 2/5] drm/xe/xe_drm_ras: Add support for XE DRM RAS Riana Tauro
@ 2026-02-28 8:08 ` Riana Tauro
2026-03-03 4:32 ` Claude review: " Claude Code Review Bot
2026-02-28 8:08 ` [PATCH v9 4/5] drm/xe/xe_hw_error: Add support for Core-Compute errors Riana Tauro
` (2 subsequent siblings)
5 siblings, 1 reply; 15+ messages in thread
From: Riana Tauro @ 2026-02-28 8:08 UTC (permalink / raw)
To: intel-xe, dri-devel
Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
raag.jadav, anvesh.bakwad, Riana Tauro
Initialize DRM RAS in hw error init. Map the UAPI error severities
with the hardware error severities and refactor file.
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
Reviewed-by: Raag Jadav <raag.jadav@intel.com>
---
v2: Fix harware error enum
add severity_str in csc handler
simplify hw_error_info_init() function
use drm_err if initialization fails (Raag)
v3: print error on failure (Raag)
v4: use const (Raag)
---
drivers/gpu/drm/xe/xe_drm_ras_types.h | 8 ++++
drivers/gpu/drm/xe/xe_hw_error.c | 62 +++++++++++++++------------
2 files changed, 42 insertions(+), 28 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_drm_ras_types.h b/drivers/gpu/drm/xe/xe_drm_ras_types.h
index 7acc5e7377b2..8d729ad6a264 100644
--- a/drivers/gpu/drm/xe/xe_drm_ras_types.h
+++ b/drivers/gpu/drm/xe/xe_drm_ras_types.h
@@ -11,6 +11,14 @@
struct drm_ras_node;
+/* Error categories reported by hardware */
+enum hardware_error {
+ HARDWARE_ERROR_CORRECTABLE = 0,
+ HARDWARE_ERROR_NONFATAL,
+ HARDWARE_ERROR_FATAL,
+ HARDWARE_ERROR_MAX
+};
+
/**
* struct xe_drm_ras_counter - XE RAS counter
*
diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
index 8c65291f36fc..baae050163df 100644
--- a/drivers/gpu/drm/xe/xe_hw_error.c
+++ b/drivers/gpu/drm/xe/xe_hw_error.c
@@ -10,20 +10,16 @@
#include "regs/xe_irq_regs.h"
#include "xe_device.h"
+#include "xe_drm_ras.h"
#include "xe_hw_error.h"
#include "xe_mmio.h"
#include "xe_survivability_mode.h"
#define HEC_UNCORR_FW_ERR_BITS 4
+
extern struct fault_attr inject_csc_hw_error;
-/* Error categories reported by hardware */
-enum hardware_error {
- HARDWARE_ERROR_CORRECTABLE = 0,
- HARDWARE_ERROR_NONFATAL = 1,
- HARDWARE_ERROR_FATAL = 2,
- HARDWARE_ERROR_MAX,
-};
+static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
static const char * const hec_uncorrected_fw_errors[] = {
"Fatal",
@@ -32,23 +28,18 @@ static const char * const hec_uncorrected_fw_errors[] = {
"Data Corruption"
};
-static const char *hw_error_to_str(const enum hardware_error hw_err)
+static bool fault_inject_csc_hw_error(void)
{
- switch (hw_err) {
- case HARDWARE_ERROR_CORRECTABLE:
- return "CORRECTABLE";
- case HARDWARE_ERROR_NONFATAL:
- return "NONFATAL";
- case HARDWARE_ERROR_FATAL:
- return "FATAL";
- default:
- return "UNKNOWN";
- }
+ return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
}
-static bool fault_inject_csc_hw_error(void)
+static enum drm_xe_ras_error_severity hw_err_to_severity(const enum hardware_error hw_err)
{
- return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
+ if (hw_err == HARDWARE_ERROR_CORRECTABLE)
+ return DRM_XE_RAS_ERR_SEV_CORRECTABLE;
+
+ /* Uncorrectable errors comprise of both fatal and non-fatal errors */
+ return DRM_XE_RAS_ERR_SEV_UNCORRECTABLE;
}
static void csc_hw_error_work(struct work_struct *work)
@@ -64,7 +55,8 @@ static void csc_hw_error_work(struct work_struct *work)
static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err)
{
- const char *hw_err_str = hw_error_to_str(hw_err);
+ const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
+ const char *severity_str = error_severity[severity];
struct xe_device *xe = tile_to_xe(tile);
struct xe_mmio *mmio = &tile->mmio;
u32 base, err_bit, err_src;
@@ -77,8 +69,8 @@ static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
lockdep_assert_held(&xe->irq.lock);
err_src = xe_mmio_read32(mmio, HEC_UNCORR_ERR_STATUS(base));
if (!err_src) {
- drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported HEC_ERR_STATUS_%s blank\n",
- tile->id, hw_err_str);
+ drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported %s HEC_ERR_STATUS register blank\n",
+ tile->id, severity_str);
return;
}
@@ -86,8 +78,8 @@ static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
fw_err = xe_mmio_read32(mmio, HEC_UNCORR_FW_ERR_DW0(base));
for_each_set_bit(err_bit, &fw_err, HEC_UNCORR_FW_ERR_BITS) {
drm_err_ratelimited(&xe->drm, HW_ERR
- "%s: HEC Uncorrected FW %s error reported, bit[%d] is set\n",
- hw_err_str, hec_uncorrected_fw_errors[err_bit],
+ "HEC FW %s %s reported, bit[%d] is set\n",
+ hec_uncorrected_fw_errors[err_bit], severity_str,
err_bit);
schedule_work(&tile->csc_hw_error_work);
@@ -99,7 +91,8 @@ static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
{
- const char *hw_err_str = hw_error_to_str(hw_err);
+ const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
+ const char *severity_str = error_severity[severity];
struct xe_device *xe = tile_to_xe(tile);
unsigned long flags;
u32 err_src;
@@ -110,8 +103,8 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
spin_lock_irqsave(&xe->irq.lock, flags);
err_src = xe_mmio_read32(&tile->mmio, DEV_ERR_STAT_REG(hw_err));
if (!err_src) {
- drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported DEV_ERR_STAT_%s blank!\n",
- tile->id, hw_err_str);
+ drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported %s DEV_ERR_STAT register blank!\n",
+ tile->id, severity_str);
goto unlock;
}
@@ -146,6 +139,14 @@ void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
hw_error_source_handler(tile, hw_err);
}
+static int hw_error_info_init(struct xe_device *xe)
+{
+ if (xe->info.platform != XE_PVC)
+ return 0;
+
+ return xe_drm_ras_init(xe);
+}
+
/*
* Process hardware errors during boot
*/
@@ -172,11 +173,16 @@ static void process_hw_errors(struct xe_device *xe)
void xe_hw_error_init(struct xe_device *xe)
{
struct xe_tile *tile = xe_device_get_root_tile(xe);
+ int ret;
if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))
return;
INIT_WORK(&tile->csc_hw_error_work, csc_hw_error_work);
+ ret = hw_error_info_init(xe);
+ if (ret)
+ drm_err(&xe->drm, "Failed to initialize XE DRM RAS (%pe)\n", ERR_PTR(ret));
+
process_hw_errors(xe);
}
--
2.47.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v9 4/5] drm/xe/xe_hw_error: Add support for Core-Compute errors
2026-02-28 8:08 [PATCH v9 0/5] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
` (2 preceding siblings ...)
2026-02-28 8:08 ` [PATCH v9 3/5] drm/xe/xe_hw_error: Integrate DRM RAS with hardware error handling Riana Tauro
@ 2026-02-28 8:08 ` Riana Tauro
2026-03-03 4:32 ` Claude review: " Claude Code Review Bot
2026-02-28 8:08 ` [PATCH v9 5/5] drm/xe/xe_hw_error: Add support for PVC SoC errors Riana Tauro
2026-03-03 4:32 ` Claude review: Introduce DRM_RAS using generic netlink for RAS Claude Code Review Bot
5 siblings, 1 reply; 15+ messages in thread
From: Riana Tauro @ 2026-02-28 8:08 UTC (permalink / raw)
To: intel-xe, dri-devel
Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
raag.jadav, anvesh.bakwad, Riana Tauro, Himal Prasad Ghimiray
PVC supports GT error reporting via vector registers along with
error status register. Add support to report these errors and
update respective counters. Incase of Subslice error reported
by vector register, process the error status register
for applicable bits.
The counter is embedded in the xe drm ras structure and is
exposed to the userspace using the drm_ras generic netlink
interface.
$ sudo ynl --family drm_ras --do get-error-counter \
--json '{"node-id":0, "error-id":1}'
{'error-id': 1, 'error-name': 'core-compute', 'error-value': 0}
Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
Reviewed-by: Raag Jadav <raag.jadav@intel.com>
---
v2: Add ID's and names as uAPI (Rodrigo)
Add documentation
Modify commit message
v3: remove 'error' from counters
use drmm_kcalloc
add a for_each for severity
differentitate error classes and severity in UAPI(Raag)
Use correctable and uncorrectable in uapi (Pratik / Aravind)
v4: modify enums in UAPI
improve comments
add bounds check in handler
add error mask macro (Raag)
use atomic_t
add null pointer checks
v5: fix alignment
remove info check
fix csc bit check (Raag)
v6: add macro for gt vector len (Raag)
---
drivers/gpu/drm/xe/regs/xe_hw_error_regs.h | 61 ++++++-
drivers/gpu/drm/xe/xe_hw_error.c | 195 +++++++++++++++++++--
2 files changed, 236 insertions(+), 20 deletions(-)
diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
index c146b9ef44eb..cd17d7d7372c 100644
--- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
@@ -6,15 +6,58 @@
#ifndef _XE_HW_ERROR_REGS_H_
#define _XE_HW_ERROR_REGS_H_
-#define HEC_UNCORR_ERR_STATUS(base) XE_REG((base) + 0x118)
-#define UNCORR_FW_REPORTED_ERR BIT(6)
+#define HEC_UNCORR_ERR_STATUS(base) XE_REG((base) + 0x118)
+#define UNCORR_FW_REPORTED_ERR REG_BIT(6)
-#define HEC_UNCORR_FW_ERR_DW0(base) XE_REG((base) + 0x124)
+#define HEC_UNCORR_FW_ERR_DW0(base) XE_REG((base) + 0x124)
-#define DEV_ERR_STAT_NONFATAL 0x100178
-#define DEV_ERR_STAT_CORRECTABLE 0x10017c
-#define DEV_ERR_STAT_REG(x) XE_REG(_PICK_EVEN((x), \
- DEV_ERR_STAT_CORRECTABLE, \
- DEV_ERR_STAT_NONFATAL))
-#define XE_CSC_ERROR BIT(17)
+#define ERR_STAT_GT_COR 0x100160
+#define EU_GRF_COR_ERR REG_BIT(15)
+#define EU_IC_COR_ERR REG_BIT(14)
+#define SLM_COR_ERR REG_BIT(13)
+#define GUC_COR_ERR REG_BIT(1)
+
+#define ERR_STAT_GT_NONFATAL 0x100164
+#define ERR_STAT_GT_FATAL 0x100168
+#define EU_GRF_FAT_ERR REG_BIT(15)
+#define SLM_FAT_ERR REG_BIT(13)
+#define GUC_FAT_ERR REG_BIT(6)
+#define FPU_FAT_ERR REG_BIT(3)
+
+#define ERR_STAT_GT_REG(x) XE_REG(_PICK_EVEN((x), \
+ ERR_STAT_GT_COR, \
+ ERR_STAT_GT_NONFATAL))
+
+#define PVC_COR_ERR_MASK (GUC_COR_ERR | SLM_COR_ERR | \
+ EU_IC_COR_ERR | EU_GRF_COR_ERR)
+
+#define PVC_FAT_ERR_MASK (FPU_FAT_ERR | GUC_FAT_ERR | \
+ EU_GRF_FAT_ERR | SLM_FAT_ERR)
+
+#define DEV_ERR_STAT_NONFATAL 0x100178
+#define DEV_ERR_STAT_CORRECTABLE 0x10017c
+#define DEV_ERR_STAT_REG(x) XE_REG(_PICK_EVEN((x), \
+ DEV_ERR_STAT_CORRECTABLE, \
+ DEV_ERR_STAT_NONFATAL))
+
+#define XE_CSC_ERROR 17
+#define XE_GT_ERROR 0
+
+#define ERR_STAT_GT_FATAL_VECTOR_0 0x100260
+#define ERR_STAT_GT_FATAL_VECTOR_1 0x100264
+
+#define ERR_STAT_GT_FATAL_VECTOR_REG(x) XE_REG(_PICK_EVEN((x), \
+ ERR_STAT_GT_FATAL_VECTOR_0, \
+ ERR_STAT_GT_FATAL_VECTOR_1))
+
+#define ERR_STAT_GT_COR_VECTOR_0 0x1002a0
+#define ERR_STAT_GT_COR_VECTOR_1 0x1002a4
+
+#define ERR_STAT_GT_COR_VECTOR_REG(x) XE_REG(_PICK_EVEN((x), \
+ ERR_STAT_GT_COR_VECTOR_0, \
+ ERR_STAT_GT_COR_VECTOR_1))
+
+#define ERR_STAT_GT_VECTOR_REG(hw_err, x) (hw_err == HARDWARE_ERROR_CORRECTABLE ? \
+ ERR_STAT_GT_COR_VECTOR_REG(x) : \
+ ERR_STAT_GT_FATAL_VECTOR_REG(x))
#endif
diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
index baae050163df..8062a6ff98a4 100644
--- a/drivers/gpu/drm/xe/xe_hw_error.c
+++ b/drivers/gpu/drm/xe/xe_hw_error.c
@@ -3,6 +3,7 @@
* Copyright © 2025 Intel Corporation
*/
+#include <linux/bitmap.h>
#include <linux/fault-inject.h>
#include "regs/xe_gsc_regs.h"
@@ -15,7 +16,13 @@
#include "xe_mmio.h"
#include "xe_survivability_mode.h"
-#define HEC_UNCORR_FW_ERR_BITS 4
+#define GT_HW_ERROR_MAX_ERR_BITS 16
+#define HEC_UNCORR_FW_ERR_BITS 4
+#define XE_RAS_REG_SIZE 32
+
+#define PVC_ERROR_MASK_SET(hw_err, err_bit) ((hw_err == HARDWARE_ERROR_CORRECTABLE) ? \
+ (PVC_COR_ERR_MASK & REG_BIT(err_bit)) : \
+ (PVC_FAT_ERR_MASK & REG_BIT(err_bit)))
extern struct fault_attr inject_csc_hw_error;
@@ -28,10 +35,24 @@ static const char * const hec_uncorrected_fw_errors[] = {
"Data Corruption"
};
-static bool fault_inject_csc_hw_error(void)
-{
- return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
-}
+static const unsigned long xe_hw_error_map[] = {
+ [XE_GT_ERROR] = DRM_XE_RAS_ERR_COMP_CORE_COMPUTE,
+};
+
+enum gt_vector_regs {
+ ERR_STAT_GT_VECTOR0 = 0,
+ ERR_STAT_GT_VECTOR1,
+ ERR_STAT_GT_VECTOR2,
+ ERR_STAT_GT_VECTOR3,
+ ERR_STAT_GT_VECTOR4,
+ ERR_STAT_GT_VECTOR5,
+ ERR_STAT_GT_VECTOR6,
+ ERR_STAT_GT_VECTOR7,
+ ERR_STAT_GT_VECTOR_MAX
+};
+
+#define PVC_GT_VECTOR_LEN(hw_err) ((hw_err == HARDWARE_ERROR_CORRECTABLE) ? \
+ ERR_STAT_GT_VECTOR4 : ERR_STAT_GT_VECTOR_MAX)
static enum drm_xe_ras_error_severity hw_err_to_severity(const enum hardware_error hw_err)
{
@@ -42,6 +63,11 @@ static enum drm_xe_ras_error_severity hw_err_to_severity(const enum hardware_err
return DRM_XE_RAS_ERR_SEV_UNCORRECTABLE;
}
+static bool fault_inject_csc_hw_error(void)
+{
+ return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
+}
+
static void csc_hw_error_work(struct work_struct *work)
{
struct xe_tile *tile = container_of(work, typeof(*tile), csc_hw_error_work);
@@ -89,15 +115,119 @@ static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
xe_mmio_write32(mmio, HEC_UNCORR_ERR_STATUS(base), err_src);
}
+static void log_hw_error(struct xe_tile *tile, const char *name,
+ const enum drm_xe_ras_error_severity severity)
+{
+ const char *severity_str = error_severity[severity];
+ struct xe_device *xe = tile_to_xe(tile);
+
+ if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE)
+ drm_warn(&xe->drm, "%s %s detected\n", name, severity_str);
+ else
+ drm_err_ratelimited(&xe->drm, "%s %s detected\n", name, severity_str);
+}
+
+static void log_gt_err(struct xe_tile *tile, const char *name, int i, u32 err,
+ const enum drm_xe_ras_error_severity severity)
+{
+ const char *severity_str = error_severity[severity];
+ struct xe_device *xe = tile_to_xe(tile);
+
+ if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE)
+ drm_warn(&xe->drm, "%s %s detected, ERROR_STAT_GT_VECTOR%d:0x%08x\n",
+ name, severity_str, i, err);
+ else
+ drm_err_ratelimited(&xe->drm, "%s %s detected, ERROR_STAT_GT_VECTOR%d:0x%08x\n",
+ name, severity_str, i, err);
+}
+
+static void gt_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err,
+ u32 error_id)
+{
+ const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
+ struct xe_device *xe = tile_to_xe(tile);
+ struct xe_drm_ras *ras = &xe->ras;
+ struct xe_drm_ras_counter *info = ras->info[severity];
+ struct xe_mmio *mmio = &tile->mmio;
+ unsigned long err_stat = 0;
+ int i;
+
+ if (xe->info.platform != XE_PVC)
+ return;
+
+ if (hw_err == HARDWARE_ERROR_NONFATAL) {
+ atomic_inc(&info[error_id].counter);
+ log_hw_error(tile, info[error_id].name, severity);
+ return;
+ }
+
+ for (i = 0; i < PVC_GT_VECTOR_LEN(hw_err); i++) {
+ u32 vector, val;
+
+ vector = xe_mmio_read32(mmio, ERR_STAT_GT_VECTOR_REG(hw_err, i));
+ if (!vector)
+ continue;
+
+ switch (i) {
+ case ERR_STAT_GT_VECTOR0:
+ case ERR_STAT_GT_VECTOR1: {
+ u32 errbit;
+
+ val = hweight32(vector);
+ atomic_add(val, &info[error_id].counter);
+ log_gt_err(tile, "Subslice", i, vector, severity);
+
+ /*
+ * Error status register is only populated once per error.
+ * Read the register and clear once.
+ */
+ if (err_stat)
+ break;
+
+ err_stat = xe_mmio_read32(mmio, ERR_STAT_GT_REG(hw_err));
+ for_each_set_bit(errbit, &err_stat, GT_HW_ERROR_MAX_ERR_BITS) {
+ if (PVC_ERROR_MASK_SET(hw_err, errbit))
+ atomic_inc(&info[error_id].counter);
+ }
+ if (err_stat)
+ xe_mmio_write32(mmio, ERR_STAT_GT_REG(hw_err), err_stat);
+ break;
+ }
+ case ERR_STAT_GT_VECTOR2:
+ case ERR_STAT_GT_VECTOR3:
+ val = hweight32(vector);
+ atomic_add(val, &info[error_id].counter);
+ log_gt_err(tile, "L3 BANK", i, vector, severity);
+ break;
+ case ERR_STAT_GT_VECTOR6:
+ val = hweight32(vector);
+ atomic_add(val, &info[error_id].counter);
+ log_gt_err(tile, "TLB", i, vector, severity);
+ break;
+ case ERR_STAT_GT_VECTOR7:
+ val = hweight32(vector);
+ atomic_add(val, &info[error_id].counter);
+ log_gt_err(tile, "L3 Fabric", i, vector, severity);
+ break;
+ default:
+ log_gt_err(tile, "Undefined", i, vector, severity);
+ }
+
+ xe_mmio_write32(mmio, ERR_STAT_GT_VECTOR_REG(hw_err, i), vector);
+ }
+}
+
static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
{
const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
const char *severity_str = error_severity[severity];
struct xe_device *xe = tile_to_xe(tile);
- unsigned long flags;
- u32 err_src;
+ struct xe_drm_ras *ras = &xe->ras;
+ struct xe_drm_ras_counter *info = ras->info[severity];
+ unsigned long flags, err_src;
+ u32 err_bit;
- if (xe->info.platform != XE_BATTLEMAGE)
+ if (!IS_DGFX(xe))
return;
spin_lock_irqsave(&xe->irq.lock, flags);
@@ -108,11 +238,53 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
goto unlock;
}
- if (err_src & XE_CSC_ERROR)
+ /*
+ * On encountering CSC firmware errors, the graphics device becomes unrecoverable
+ * so return immediately on error. The only way to recover from these errors is
+ * firmware flash. The device will enter Runtime Survivability mode when such
+ * errors are detected.
+ */
+ if (err_src & REG_BIT(XE_CSC_ERROR)) {
csc_hw_error_handler(tile, hw_err);
+ goto clear_reg;
+ }
- xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
+ if (!info)
+ goto clear_reg;
+
+ for_each_set_bit(err_bit, &err_src, XE_RAS_REG_SIZE) {
+ const char *name;
+ u32 error_id;
+
+ /* Check error bit is within bounds */
+ if (err_bit >= ARRAY_SIZE(xe_hw_error_map))
+ break;
+
+ error_id = xe_hw_error_map[err_bit];
+
+ /* Check error component is within max */
+ if (!error_id || error_id >= DRM_XE_RAS_ERR_COMP_MAX)
+ continue;
+ name = info[error_id].name;
+ if (!name)
+ continue;
+
+ if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE) {
+ drm_warn(&xe->drm, HW_ERR
+ "TILE%d reported %s %s, bit[%d] is set\n",
+ tile->id, name, severity_str, err_bit);
+ } else {
+ drm_err_ratelimited(&xe->drm, HW_ERR
+ "TILE%d reported %s %s, bit[%d] is set\n",
+ tile->id, name, severity_str, err_bit);
+ }
+ if (err_bit == XE_GT_ERROR)
+ gt_hw_error_handler(tile, hw_err, error_id);
+ }
+
+clear_reg:
+ xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
unlock:
spin_unlock_irqrestore(&xe->irq.lock, flags);
}
@@ -134,9 +306,10 @@ void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
if (fault_inject_csc_hw_error())
schedule_work(&tile->csc_hw_error_work);
- for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++)
+ for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++) {
if (master_ctl & ERROR_IRQ(hw_err))
hw_error_source_handler(tile, hw_err);
+ }
}
static int hw_error_info_init(struct xe_device *xe)
--
2.47.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v9 5/5] drm/xe/xe_hw_error: Add support for PVC SoC errors
2026-02-28 8:08 [PATCH v9 0/5] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
` (3 preceding siblings ...)
2026-02-28 8:08 ` [PATCH v9 4/5] drm/xe/xe_hw_error: Add support for Core-Compute errors Riana Tauro
@ 2026-02-28 8:08 ` Riana Tauro
2026-03-03 4:32 ` Claude review: " Claude Code Review Bot
2026-03-03 4:32 ` Claude review: Introduce DRM_RAS using generic netlink for RAS Claude Code Review Bot
5 siblings, 1 reply; 15+ messages in thread
From: Riana Tauro @ 2026-02-28 8:08 UTC (permalink / raw)
To: intel-xe, dri-devel
Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
raag.jadav, anvesh.bakwad, Riana Tauro, Himal Prasad Ghimiray
Report the SoC nonfatal/fatal hardware error and update the counters.
$ sudo ynl --family drm_ras --do get-error-counter \
--json '{"node-id":0, "error-id":2}'
{'error-id': 2, 'error-name': 'soc-internal', 'error-value': 0}
Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
Reviewed-by: Raag Jadav <raag.jadav@intel.com>
---
v2: Add ID's and names as uAPI (Rodrigo)
v3: reorder and align arrays
remove redundant string err
use REG_BIT
fix aesthic review comments (Raag)
use only correctable/uncorrectable error severity (Aravind)
v4: fix comments
use master as variable name
add static_assert (Raag)
v5: remove blank lines
fix alignment
rename variables of base registers
move register initialization outside loop (Raag)
v6: add macro for master local reg (Raag)
---
drivers/gpu/drm/xe/regs/xe_hw_error_regs.h | 25 +++
drivers/gpu/drm/xe/xe_hw_error.c | 206 +++++++++++++++++++++
2 files changed, 231 insertions(+)
diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
index cd17d7d7372c..046e1756c698 100644
--- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
@@ -41,6 +41,7 @@
DEV_ERR_STAT_NONFATAL))
#define XE_CSC_ERROR 17
+#define XE_SOC_ERROR 16
#define XE_GT_ERROR 0
#define ERR_STAT_GT_FATAL_VECTOR_0 0x100260
@@ -60,4 +61,28 @@
#define ERR_STAT_GT_VECTOR_REG(hw_err, x) (hw_err == HARDWARE_ERROR_CORRECTABLE ? \
ERR_STAT_GT_COR_VECTOR_REG(x) : \
ERR_STAT_GT_FATAL_VECTOR_REG(x))
+
+#define SOC_PVC_MASTER_BASE 0x282000
+#define SOC_PVC_SLAVE_BASE 0x283000
+
+#define SOC_GCOERRSTS 0x200
+#define SOC_GNFERRSTS 0x210
+#define SOC_GLOBAL_ERR_STAT_REG(base, x) XE_REG(_PICK_EVEN((x), \
+ (base) + SOC_GCOERRSTS, \
+ (base) + SOC_GNFERRSTS))
+#define SOC_SLAVE_IEH REG_BIT(1)
+#define SOC_IEH0_LOCAL_ERR_STATUS REG_BIT(0)
+#define SOC_IEH1_LOCAL_ERR_STATUS REG_BIT(0)
+
+#define SOC_GSYSEVTCTL 0x264
+#define SOC_GSYSEVTCTL_REG(master, slave, x) XE_REG(_PICK_EVEN((x), \
+ (master) + SOC_GSYSEVTCTL, \
+ (slave) + SOC_GSYSEVTCTL))
+
+#define SOC_LERRUNCSTS 0x280
+#define SOC_LERRCORSTS 0x294
+#define SOC_LOCAL_ERR_STAT_REG(base, hw_err) XE_REG(hw_err == HARDWARE_ERROR_CORRECTABLE ? \
+ (base) + SOC_LERRCORSTS : \
+ (base) + SOC_LERRUNCSTS)
+
#endif
diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
index 8062a6ff98a4..2a31b430570e 100644
--- a/drivers/gpu/drm/xe/xe_hw_error.c
+++ b/drivers/gpu/drm/xe/xe_hw_error.c
@@ -19,6 +19,7 @@
#define GT_HW_ERROR_MAX_ERR_BITS 16
#define HEC_UNCORR_FW_ERR_BITS 4
#define XE_RAS_REG_SIZE 32
+#define XE_SOC_NUM_IEH 2
#define PVC_ERROR_MASK_SET(hw_err, err_bit) ((hw_err == HARDWARE_ERROR_CORRECTABLE) ? \
(PVC_COR_ERR_MASK & REG_BIT(err_bit)) : \
@@ -37,6 +38,7 @@ static const char * const hec_uncorrected_fw_errors[] = {
static const unsigned long xe_hw_error_map[] = {
[XE_GT_ERROR] = DRM_XE_RAS_ERR_COMP_CORE_COMPUTE,
+ [XE_SOC_ERROR] = DRM_XE_RAS_ERR_COMP_SOC_INTERNAL,
};
enum gt_vector_regs {
@@ -63,6 +65,101 @@ static enum drm_xe_ras_error_severity hw_err_to_severity(const enum hardware_err
return DRM_XE_RAS_ERR_SEV_UNCORRECTABLE;
}
+static const char * const pvc_master_global_err_reg[] = {
+ [0 ... 1] = "Undefined",
+ [2] = "HBM SS0: Channel0",
+ [3] = "HBM SS0: Channel1",
+ [4] = "HBM SS0: Channel2",
+ [5] = "HBM SS0: Channel3",
+ [6] = "HBM SS0: Channel4",
+ [7] = "HBM SS0: Channel5",
+ [8] = "HBM SS0: Channel6",
+ [9] = "HBM SS0: Channel7",
+ [10] = "HBM SS1: Channel0",
+ [11] = "HBM SS1: Channel1",
+ [12] = "HBM SS1: Channel2",
+ [13] = "HBM SS1: Channel3",
+ [14] = "HBM SS1: Channel4",
+ [15] = "HBM SS1: Channel5",
+ [16] = "HBM SS1: Channel6",
+ [17] = "HBM SS1: Channel7",
+ [18 ... 31] = "Undefined",
+};
+static_assert(ARRAY_SIZE(pvc_master_global_err_reg) == XE_RAS_REG_SIZE);
+
+static const char * const pvc_slave_global_err_reg[] = {
+ [0] = "Undefined",
+ [1] = "HBM SS2: Channel0",
+ [2] = "HBM SS2: Channel1",
+ [3] = "HBM SS2: Channel2",
+ [4] = "HBM SS2: Channel3",
+ [5] = "HBM SS2: Channel4",
+ [6] = "HBM SS2: Channel5",
+ [7] = "HBM SS2: Channel6",
+ [8] = "HBM SS2: Channel7",
+ [9] = "HBM SS3: Channel0",
+ [10] = "HBM SS3: Channel1",
+ [11] = "HBM SS3: Channel2",
+ [12] = "HBM SS3: Channel3",
+ [13] = "HBM SS3: Channel4",
+ [14] = "HBM SS3: Channel5",
+ [15] = "HBM SS3: Channel6",
+ [16] = "HBM SS3: Channel7",
+ [17] = "Undefined",
+ [18] = "ANR MDFI",
+ [19 ... 31] = "Undefined",
+};
+static_assert(ARRAY_SIZE(pvc_slave_global_err_reg) == XE_RAS_REG_SIZE);
+
+static const char * const pvc_slave_local_fatal_err_reg[] = {
+ [0] = "Local IEH: Malformed PCIe AER",
+ [1] = "Local IEH: Malformed PCIe ERR",
+ [2] = "Local IEH: UR conditions in IEH",
+ [3] = "Local IEH: From SERR Sources",
+ [4 ... 19] = "Undefined",
+ [20] = "Malformed MCA error packet (HBM/Punit)",
+ [21 ... 31] = "Undefined",
+};
+static_assert(ARRAY_SIZE(pvc_slave_local_fatal_err_reg) == XE_RAS_REG_SIZE);
+
+static const char * const pvc_master_local_fatal_err_reg[] = {
+ [0] = "Local IEH: Malformed IOSF PCIe AER",
+ [1] = "Local IEH: Malformed IOSF PCIe ERR",
+ [2] = "Local IEH: UR RESPONSE",
+ [3] = "Local IEH: From SERR SPI controller",
+ [4] = "Base Die MDFI T2T",
+ [5] = "Undefined",
+ [6] = "Base Die MDFI T2C",
+ [7] = "Undefined",
+ [8] = "Invalid CSC PSF Command Parity",
+ [9] = "Invalid CSC PSF Unexpected Completion",
+ [10] = "Invalid CSC PSF Unsupported Request",
+ [11] = "Invalid PCIe PSF Command Parity",
+ [12] = "PCIe PSF Unexpected Completion",
+ [13] = "PCIe PSF Unsupported Request",
+ [14 ... 19] = "Undefined",
+ [20] = "Malformed MCA error packet (HBM/Punit)",
+ [21 ... 31] = "Undefined",
+};
+static_assert(ARRAY_SIZE(pvc_master_local_fatal_err_reg) == XE_RAS_REG_SIZE);
+
+static const char * const pvc_master_local_nonfatal_err_reg[] = {
+ [0 ... 3] = "Undefined",
+ [4] = "Base Die MDFI T2T",
+ [5] = "Undefined",
+ [6] = "Base Die MDFI T2C",
+ [7] = "Undefined",
+ [8] = "Invalid CSC PSF Command Parity",
+ [9] = "Invalid CSC PSF Unexpected Completion",
+ [10] = "Invalid PCIe PSF Command Parity",
+ [11 ... 31] = "Undefined",
+};
+static_assert(ARRAY_SIZE(pvc_master_local_nonfatal_err_reg) == XE_RAS_REG_SIZE);
+
+#define PVC_MASTER_LOCAL_REG_INFO(hw_err) ((hw_err == HARDWARE_ERROR_FATAL) ? \
+ pvc_master_local_fatal_err_reg : \
+ pvc_master_local_nonfatal_err_reg)
+
static bool fault_inject_csc_hw_error(void)
{
return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
@@ -141,6 +238,26 @@ static void log_gt_err(struct xe_tile *tile, const char *name, int i, u32 err,
name, severity_str, i, err);
}
+static void log_soc_error(struct xe_tile *tile, const char * const *reg_info,
+ const enum drm_xe_ras_error_severity severity, u32 err_bit, u32 index)
+{
+ const char *severity_str = error_severity[severity];
+ struct xe_device *xe = tile_to_xe(tile);
+ struct xe_drm_ras *ras = &xe->ras;
+ struct xe_drm_ras_counter *info = ras->info[severity];
+ const char *name;
+
+ name = reg_info[err_bit];
+
+ if (strcmp(name, "Undefined")) {
+ if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE)
+ drm_warn(&xe->drm, "%s SOC %s detected", name, severity_str);
+ else
+ drm_err_ratelimited(&xe->drm, "%s SOC %s detected", name, severity_str);
+ atomic_inc(&info[index].counter);
+ }
+}
+
static void gt_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err,
u32 error_id)
{
@@ -217,6 +334,92 @@ static void gt_hw_error_handler(struct xe_tile *tile, const enum hardware_error
}
}
+static void soc_slave_ieh_handler(struct xe_tile *tile, const enum hardware_error hw_err, u32 error_id)
+{
+ const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
+ unsigned long slave_global_errstat, slave_local_errstat;
+ struct xe_mmio *mmio = &tile->mmio;
+ u32 regbit, slave;
+
+ slave = SOC_PVC_SLAVE_BASE;
+ slave_global_errstat = xe_mmio_read32(mmio, SOC_GLOBAL_ERR_STAT_REG(slave, hw_err));
+
+ if (slave_global_errstat & SOC_IEH1_LOCAL_ERR_STATUS) {
+ slave_local_errstat = xe_mmio_read32(mmio, SOC_LOCAL_ERR_STAT_REG(slave, hw_err));
+
+ if (hw_err == HARDWARE_ERROR_FATAL) {
+ for_each_set_bit(regbit, &slave_local_errstat, XE_RAS_REG_SIZE)
+ log_soc_error(tile, pvc_slave_local_fatal_err_reg, severity,
+ regbit, error_id);
+ }
+
+ xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(slave, hw_err),
+ slave_local_errstat);
+ }
+
+ for_each_set_bit(regbit, &slave_global_errstat, XE_RAS_REG_SIZE)
+ log_soc_error(tile, pvc_slave_global_err_reg, severity, regbit, error_id);
+
+ xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(slave, hw_err), slave_global_errstat);
+}
+
+static void soc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err,
+ u32 error_id)
+{
+ const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
+ struct xe_device *xe = tile_to_xe(tile);
+ struct xe_mmio *mmio = &tile->mmio;
+ unsigned long master_global_errstat, master_local_errstat;
+ u32 master, slave, regbit;
+ int i;
+
+ if (xe->info.platform != XE_PVC)
+ return;
+
+ master = SOC_PVC_MASTER_BASE;
+ slave = SOC_PVC_SLAVE_BASE;
+
+ /* Mask error type in GSYSEVTCTL so that no new errors of the type will be reported */
+ for (i = 0; i < XE_SOC_NUM_IEH; i++)
+ xe_mmio_write32(mmio, SOC_GSYSEVTCTL_REG(master, slave, i), ~REG_BIT(hw_err));
+
+ if (hw_err == HARDWARE_ERROR_CORRECTABLE) {
+ xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(master, hw_err), REG_GENMASK(31, 0));
+ xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(master, hw_err), REG_GENMASK(31, 0));
+ xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(slave, hw_err), REG_GENMASK(31, 0));
+ xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(slave, hw_err), REG_GENMASK(31, 0));
+ goto unmask_gsysevtctl;
+ }
+
+ /*
+ * Read the master global IEH error register, if BIT(1) is set then process
+ * the slave IEH first. If BIT(0) in global error register is set then process
+ * the corresponding local error registers.
+ */
+ master_global_errstat = xe_mmio_read32(mmio, SOC_GLOBAL_ERR_STAT_REG(master, hw_err));
+ if (master_global_errstat & SOC_SLAVE_IEH)
+ soc_slave_ieh_handler(tile, hw_err, error_id);
+
+ if (master_global_errstat & SOC_IEH0_LOCAL_ERR_STATUS) {
+ master_local_errstat = xe_mmio_read32(mmio, SOC_LOCAL_ERR_STAT_REG(master, hw_err));
+
+ for_each_set_bit(regbit, &master_local_errstat, XE_RAS_REG_SIZE)
+ log_soc_error(tile, PVC_MASTER_LOCAL_REG_INFO(hw_err), severity, regbit, error_id);
+
+ xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(master, hw_err), master_local_errstat);
+ }
+
+ for_each_set_bit(regbit, &master_global_errstat, XE_RAS_REG_SIZE)
+ log_soc_error(tile, pvc_master_global_err_reg, severity, regbit, error_id);
+
+ xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(master, hw_err), master_global_errstat);
+
+unmask_gsysevtctl:
+ for (i = 0; i < XE_SOC_NUM_IEH; i++)
+ xe_mmio_write32(mmio, SOC_GSYSEVTCTL_REG(master, slave, i),
+ (HARDWARE_ERROR_MAX << 1) + 1);
+}
+
static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
{
const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
@@ -279,8 +482,11 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
"TILE%d reported %s %s, bit[%d] is set\n",
tile->id, name, severity_str, err_bit);
}
+
if (err_bit == XE_GT_ERROR)
gt_hw_error_handler(tile, hw_err, error_id);
+ if (err_bit == XE_SOC_ERROR)
+ soc_hw_error_handler(tile, hw_err, error_id);
}
clear_reg:
--
2.47.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH v9 1/5] drm/ras: Introduce the DRM RAS infrastructure over generic netlink
2026-02-28 8:08 ` [PATCH v9 1/5] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Riana Tauro
@ 2026-02-28 16:04 ` Jakub Kicinski
2026-03-03 4:32 ` Claude review: " Claude Code Review Bot
1 sibling, 0 replies; 15+ messages in thread
From: Jakub Kicinski @ 2026-02-28 16:04 UTC (permalink / raw)
To: Riana Tauro
Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
shubham.kumar, ravi.kishore.koppuravuri, raag.jadav,
anvesh.bakwad, Zack McKevitt, Lijo Lazar, Hawking Zhang,
David S. Miller, Paolo Abeni, Eric Dumazet, netdev
On Sat, 28 Feb 2026 13:38:25 +0530 Riana Tauro wrote:
> Documentation/gpu/drm-ras.rst | 103 +++++++
> Documentation/gpu/index.rst | 1 +
> Documentation/netlink/specs/drm_ras.yaml | 115 ++++++++
> drivers/gpu/drm/Kconfig | 10 +
> drivers/gpu/drm/Makefile | 1 +
> drivers/gpu/drm/drm_drv.c | 6 +
> drivers/gpu/drm/drm_ras.c | 354 +++++++++++++++++++++++
> drivers/gpu/drm/drm_ras_genl_family.c | 42 +++
> drivers/gpu/drm/drm_ras_nl.c | 56 ++++
> drivers/gpu/drm/drm_ras_nl.h | 24 ++
> include/drm/drm_ras.h | 75 +++++
> include/drm/drm_ras_genl_family.h | 17 ++
> include/uapi/drm/drm_ras.h | 49 ++++
Doesn't apply to my local trees any more but _looks_ right :)
Assuming ./tools/net/ynl/ynl-regen.sh -f doesn't result in a dirty tree:
Acked-by: Jakub Kicinski <kuba@kernel.org>
> + if (!info->attrs ||
> + !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] ||
> + !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID])
> + return -EINVAL;
Sorry for not noticing earlier but you could use GENL_REQ_ATTR_CHECK()
here. It will give the user a nice error pointing out which attr was
missing (assuming user space uses a competent ynl-based library).
^ permalink raw reply [flat|nested] 15+ messages in thread
* Claude review: Introduce DRM_RAS using generic netlink for RAS
2026-02-28 8:08 [PATCH v9 0/5] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
` (4 preceding siblings ...)
2026-02-28 8:08 ` [PATCH v9 5/5] drm/xe/xe_hw_error: Add support for PVC SoC errors Riana Tauro
@ 2026-03-03 4:32 ` Claude Code Review Bot
5 siblings, 0 replies; 15+ messages in thread
From: Claude Code Review Bot @ 2026-03-03 4:32 UTC (permalink / raw)
To: dri-devel-reviews
Overall Series Review
Subject: Introduce DRM_RAS using generic netlink for RAS
Author: Riana Tauro <riana.tauro@intel.com>
Patches: 7
Reviewed: 2026-03-03T14:32:49.890405
---
This series introduces a DRM RAS (Reliability, Availability, Serviceability) infrastructure over generic netlink, with an initial Xe/PVC driver implementation. The architecture — a single `drm-ras` genl family with per-driver nodes — is reasonable and follows the modern YAML-based netlink approach. However, there are several bugs that need to be fixed: a memory leak in the netlink doit path, use of an uninitialized variable in dump handlers, missing cleanup on partial registration failure, loss of error granularity (FATAL vs NONFATAL) in log messages, and a fragile "Undefined" string-comparison pattern. The series is on the right track for v10 but needs another pass.
---
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 15+ messages in thread
* Claude review: drm/ras: Introduce the DRM RAS infrastructure over generic netlink
2026-02-28 8:08 ` [PATCH v9 1/5] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Riana Tauro
2026-02-28 16:04 ` Jakub Kicinski
@ 2026-03-03 4:32 ` Claude Code Review Bot
1 sibling, 0 replies; 15+ messages in thread
From: Claude Code Review Bot @ 2026-03-03 4:32 UTC (permalink / raw)
To: dri-devel-reviews
Patch Review
**Memory leak in `doit_reply_value()`:**
When `get_node_error_counter()` fails, `msg` (allocated by `genlmsg_new`) is leaked:
```c
+ ret = get_node_error_counter(node_id, error_id,
+ &error_name, &value);
+ if (ret)
+ return ret; /* msg leaked here */
```
This needs `nlmsg_free(msg)` before the return.
**Uninitialized `ret` in `drm_ras_nl_list_nodes_dumpit()`:**
`ret` is declared but never initialized. If `drm_ras_xa` is empty, `xa_for_each_start` never executes the loop body, and `ret` is used uninitialized:
```c
+ int ret;
+ ...
+ xa_for_each_start(&drm_ras_xa, id, node, ctx->restart) {
+ ...
+ }
+
+ if (ret == -EMSGSIZE) /* UB: ret uninitialized if xarray empty */
```
Initialize `ret = 0`.
**Same issue in `drm_ras_nl_get_error_counter_dumpit()`:**
If all error IDs in the range return `-ENOENT` (skipped via `continue`), then at the end of the loop `ret` is `-ENOENT`. The function will return `-ENOENT` to userspace instead of `0` (dump complete). The `ret` variable also needs to be initialized to `0` and the skip logic needs to reset `ret`.
**No concurrency protection on xarray access:**
`xa_load()` in the netlink handlers returns a pointer that can be invalidated by a concurrent `drm_ras_node_unregister()`. The `query_error_counter` callback could then be called on freed memory. At minimum, RCU read-side protection should be used around the lookup + callback invocation, and `xa_erase` should be followed by `synchronize_rcu()` or the node should be RCU-freed.
**Unnecessary include:**
```c
+#include <linux/netdevice.h>
```
This header is for network devices and should not be needed for generic netlink RAS infrastructure. `<net/genetlink.h>` should suffice.
**`drm_core_init` error path:**
```c
+ ret = drm_ras_genl_family_register();
+ if (ret < 0)
+ goto error;
```
The `goto error` doesn't unwind `drm_privacy_screen_lookup_init()` called just above. Check whether the existing `error` label handles this correctly; if not, a new cleanup label is needed.
---
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 15+ messages in thread
* Claude review: drm/xe/xe_drm_ras: Add support for XE DRM RAS
2026-02-28 8:08 ` [PATCH v9 2/5] drm/xe/xe_drm_ras: Add support for XE DRM RAS Riana Tauro
@ 2026-03-03 4:32 ` Claude Code Review Bot
0 siblings, 0 replies; 15+ messages in thread
From: Claude Code Review Bot @ 2026-03-03 4:32 UTC (permalink / raw)
To: dri-devel-reviews
Patch Review
**Partial registration leak in `register_nodes()`:**
If the second severity node fails to register, only that node is cleaned up. The first (already-registered) node is left dangling:
```c
+ for_each_error_severity(i) {
+ ...
+ ret = drm_ras_node_register(node);
+ if (ret) {
+ cleanup_node_param(ras, i);
+ return ret; /* previously registered node[0] leaked */
+ }
+ }
```
The error path needs to unregister and clean up all previously-succeeded nodes.
**`kasprintf` for PCI BDF name:**
```c
+ device_name = kasprintf(GFP_KERNEL, "%04x:%02x:%02x.%d",
+ pci_domain_nr(pdev->bus), pdev->bus->number,
+ PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
```
Consider using `pci_name(pdev)` which returns the same BDF string and avoids the separate allocation. If a separate copy is needed, `kstrdup(pci_name(pdev), GFP_KERNEL)` is simpler.
**`for_each_error_severity` macro in header without necessary include:**
```c
+#define for_each_error_severity(i) \
+ for (i = 0; i < DRM_XE_RAS_ERR_SEV_MAX; i++)
```
`DRM_XE_RAS_ERR_SEV_MAX` comes from `xe_drm.h` but `xe_drm_ras.h` doesn't include it. Callers must remember to include `xe_drm.h` first, which is fragile.
**uAPI string-table macros (`DRM_XE_RAS_ERROR_SEVERITY_NAMES`, `DRM_XE_RAS_ERROR_COMPONENT_NAMES`):**
Defining string table macros in a uAPI header is unusual. These are kernel-internal lookup tables and should not be part of the userspace ABI. The enum values and their numeric IDs are sufficient for userspace; string names are already provided through the netlink attributes.
---
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 15+ messages in thread
* Claude review: drm/xe/xe_hw_error: Integrate DRM RAS with hardware error handling
2026-02-28 8:08 ` [PATCH v9 3/5] drm/xe/xe_hw_error: Integrate DRM RAS with hardware error handling Riana Tauro
@ 2026-03-03 4:32 ` Claude Code Review Bot
0 siblings, 0 replies; 15+ messages in thread
From: Claude Code Review Bot @ 2026-03-03 4:32 UTC (permalink / raw)
To: dri-devel-reviews
Patch Review
**Loss of FATAL vs NONFATAL distinction in log messages:**
The old code logged distinct strings ("CORRECTABLE", "NONFATAL", "FATAL"). The new code collapses NONFATAL and FATAL into "uncorrectable-errors":
```c
+static enum drm_xe_ras_error_severity hw_err_to_severity(const enum hardware_error hw_err)
+{
+ if (hw_err == HARDWARE_ERROR_CORRECTABLE)
+ return DRM_XE_RAS_ERR_SEV_CORRECTABLE;
+
+ /* Uncorrectable errors comprise of both fatal and non-fatal errors */
+ return DRM_XE_RAS_ERR_SEV_UNCORRECTABLE;
+}
```
While this mapping is correct for the RAS counter abstraction, using the severity string directly in kernel log messages (e.g. `"HEC FW %s %s reported"`) loses important diagnostic information. A FATAL error requiring firmware flash vs a NONFATAL error should be distinguishable in `dmesg`. Consider keeping a separate `hw_error_to_str()` for log messages while using `hw_err_to_severity()` for counter bucketing.
**`hw_error_info_init` only for PVC but handler is still Battlemage-gated:**
```c
+static int hw_error_info_init(struct xe_device *xe)
+{
+ if (xe->info.platform != XE_PVC)
+ return 0;
+ return xe_drm_ras_init(xe);
+}
```
In this patch, `hw_error_source_handler` still checks `xe->info.platform != XE_BATTLEMAGE`, so Battlemage will still enter the handler but without RAS info initialized. The `info` pointer will be NULL. This is addressed in patch 4 with the `if (!info) goto clear_reg` guard, but it means this patch alone has a potential NULL dereference on Battlemage if applied independently (broken bisectability).
---
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 15+ messages in thread
* Claude review: drm/xe/xe_hw_error: Add support for Core-Compute errors
2026-02-28 8:08 ` [PATCH v9 4/5] drm/xe/xe_hw_error: Add support for Core-Compute errors Riana Tauro
@ 2026-03-03 4:32 ` Claude Code Review Bot
0 siblings, 0 replies; 15+ messages in thread
From: Claude Code Review Bot @ 2026-03-03 4:32 UTC (permalink / raw)
To: dri-devel-reviews
Patch Review
**Platform guard broadened without full coverage:**
```c
- if (xe->info.platform != XE_BATTLEMAGE)
+ if (!IS_DGFX(xe))
return;
```
This now enters `hw_error_source_handler` for all discrete GPUs, but `hw_error_info_init` only initializes RAS for PVC. For Battlemage (and future DGFX), `info` will be NULL. The `if (!info) goto clear_reg` guard prevents a crash, but it also means **Battlemage silently stops processing error bits other than CSC** — a regression from the current code that does process them (even if just clearing the register). Consider whether this behavioral change is intentional.
**`xe_hw_error_map` array size issue with `break` vs `continue`:**
```c
+ for_each_set_bit(err_bit, &err_src, XE_RAS_REG_SIZE) {
+ /* Check error bit is within bounds */
+ if (err_bit >= ARRAY_SIZE(xe_hw_error_map))
+ break;
```
In this patch, `xe_hw_error_map` has only index `[0]`, so `ARRAY_SIZE` is 1. The `break` exits the entire loop when any bit > 0 is set. This means if bit 0 (GT) and bit 17 (CSC) are both set in the same status register read, CSC would be silently skipped. Using `continue` instead of `break` would be safer, though CSC is handled separately above. In patch 5 the array grows to size 17, but bits 1-15 would still exit the loop prematurely due to `break`.
**Potential double-counting in subslice error path:**
```c
+ case ERR_STAT_GT_VECTOR0:
+ case ERR_STAT_GT_VECTOR1: {
+ val = hweight32(vector);
+ atomic_add(val, &info[error_id].counter);
+ ...
+ err_stat = xe_mmio_read32(mmio, ERR_STAT_GT_REG(hw_err));
+ for_each_set_bit(errbit, &err_stat, GT_HW_ERROR_MAX_ERR_BITS) {
+ if (PVC_ERROR_MASK_SET(hw_err, errbit))
+ atomic_inc(&info[error_id].counter);
+ }
```
The vector register bits and the error status register bits both increment the same counter. If these represent the same underlying errors reported through two different registers, this double-counts. Please clarify in comments whether these are truly independent error events.
**`PVC_ERROR_MASK_SET` macro missing parentheses around `hw_err`:**
```c
+#define PVC_ERROR_MASK_SET(hw_err, err_bit) ((hw_err == HARDWARE_ERROR_CORRECTABLE) ? \
```
Should be `((hw_err) == HARDWARE_ERROR_CORRECTABLE)` to prevent operator-precedence issues if a complex expression is passed.
Same issue in `PVC_GT_VECTOR_LEN`, `ERR_STAT_GT_VECTOR_REG`, and other macros using unparenthesized parameters.
---
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 15+ messages in thread
* Claude review: drm/xe/xe_hw_error: Add support for PVC SoC errors
2026-02-28 8:08 ` [PATCH v9 5/5] drm/xe/xe_hw_error: Add support for PVC SoC errors Riana Tauro
@ 2026-03-03 4:32 ` Claude Code Review Bot
0 siblings, 0 replies; 15+ messages in thread
From: Claude Code Review Bot @ 2026-03-03 4:32 UTC (permalink / raw)
To: dri-devel-reviews
Patch Review
**Fragile "Undefined" string comparison:**
```c
+ if (strcmp(name, "Undefined")) {
+ ...
+ atomic_inc(&info[index].counter);
+ }
```
Using `strcmp` against a magic string to skip entries is error-prone. If someone typos "Undefined" in one of the tables, errors would be silently miscounted. Use `NULL` entries and a `NULL` check instead, consistent with how other kernel error tables work.
**Correctable SoC errors silently cleared without logging or counting:**
```c
+ if (hw_err == HARDWARE_ERROR_CORRECTABLE) {
+ xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(master, hw_err), REG_GENMASK(31, 0));
+ xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(master, hw_err), REG_GENMASK(31, 0));
+ xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(slave, hw_err), REG_GENMASK(31, 0));
+ xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(slave, hw_err), REG_GENMASK(31, 0));
+ goto unmask_gsysevtctl;
+ }
```
The registers are cleared with all-ones without first reading them to determine which errors occurred. No counter is incremented and no log message is produced. This makes the correctable SoC error counter permanently zero, which defeats the purpose of exposing it via RAS. At minimum, the registers should be read and counted before clearing.
**Magic unmask value:**
```c
+ xe_mmio_write32(mmio, SOC_GSYSEVTCTL_REG(master, slave, i),
+ (HARDWARE_ERROR_MAX << 1) + 1);
```
`(HARDWARE_ERROR_MAX << 1) + 1` = `(3 << 1) + 1` = `7` = `0b111`. This enables all three error types (correctable, nonfatal, fatal). This should be a named constant with a comment explaining the register field layout.
**`SOC_IEH0_LOCAL_ERR_STATUS` and `SOC_IEH1_LOCAL_ERR_STATUS` both `REG_BIT(0)`:**
```c
+#define SOC_IEH0_LOCAL_ERR_STATUS REG_BIT(0)
+#define SOC_IEH1_LOCAL_ERR_STATUS REG_BIT(0)
```
Two different names for the exact same bit value is confusing. If these are the same bit in different registers, a single define with a clear name (or a comment explaining the distinction) would be better.
**Potential over-counting from global + local register processing:**
In `soc_slave_ieh_handler` and the master handler, both the local error register bits and global error register bits are iterated, and each calls `log_soc_error()` which increments the counter. If a single error event sets bits in both registers, the counter is incremented multiple times. Clarify whether this is the intended behavior.
**`slave` variable declared but only used as a constant:**
```c
+ u32 master, slave, regbit;
+ ...
+ master = SOC_PVC_MASTER_BASE;
+ slave = SOC_PVC_SLAVE_BASE;
```
These are compile-time constants. Consider using the defines directly or making them `const`.
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 15+ messages in thread
* Claude review: drm/xe/xe_hw_error: Add support for Core-Compute errors
2026-03-04 7:44 ` [PATCH v10 4/5] drm/xe/xe_hw_error: Add support for Core-Compute errors Riana Tauro
@ 2026-03-05 3:47 ` Claude Code Review Bot
0 siblings, 0 replies; 15+ messages in thread
From: Claude Code Review Bot @ 2026-03-05 3:47 UTC (permalink / raw)
To: dri-devel-reviews
Patch Review
**Bug: `err_src` type changed but original check broken**
```c
unsigned long flags, err_src;
...
err_src = xe_mmio_read32(&tile->mmio, DEV_ERR_STAT_REG(hw_err));
if (!err_src) {
```
`err_src` is changed from `u32` to `unsigned long` (for `for_each_set_bit`). This is fine on 64-bit, but the `xe_mmio_read32` return is u32 — on 64-bit the upper 32 bits will be zero, so no issue. But this is worth noting for clarity.
**Bug: `xe_hw_error_map` is too small for the full register width**
```c
static const unsigned long xe_hw_error_map[] = {
[XE_GT_ERROR] = DRM_XE_RAS_ERR_COMP_CORE_COMPUTE, // index 0
};
```
This array has only 1 entry (after patch 4) or 17 entries (after patch 5 adds `[XE_SOC_ERROR] = ...` at index 16). But the `for_each_set_bit` loop iterates up to `XE_RAS_REG_SIZE` (32 bits):
```c
for_each_set_bit(err_bit, &err_src, XE_RAS_REG_SIZE) {
if (err_bit >= ARRAY_SIZE(xe_hw_error_map))
break;
```
The `break` on out-of-bounds is correct but means any error bits above the array size will cause the loop to stop entirely, potentially missing lower-numbered error bits that haven't been processed yet if bits are set in a non-sequential order. Wait — `for_each_set_bit` iterates in ascending order, so if `ARRAY_SIZE` is 1 (patch 4 only), any bit above 0 will `break` the loop. This is actually correct since only bit 0 maps to anything in patch 4, but the use of `break` rather than `continue` means if bit 17 (CSC) is handled before this loop and bit 0 is also set, we'd never reach here due to the `goto clear_reg` after CSC. OK, this works but is fragile.
**Concern: Counting methodology may inflate counters**
```c
val = hweight32(vector);
atomic_add(val, &info[error_id].counter);
```
For subslice errors, the code counts the number of set bits in the vector register AND also reads the error status register and counts its set bits, all incrementing the same counter:
```c
atomic_add(val, &info[error_id].counter); // vector bits
...
for_each_set_bit(errbit, &err_stat, GT_HW_ERROR_MAX_ERR_BITS) {
if (PVC_ERROR_MASK_SET(hw_err, errbit))
atomic_inc(&info[error_id].counter); // status bits
}
```
This means a single error event could increment the counter by `hweight32(vector) + hweight32(err_stat & mask)`. Is this the intended counting behavior? It seems like it might double-count or over-count errors.
**Minor: Missing `HW_ERR` prefix in some log messages**
The new `log_hw_error()` and `log_gt_err()` functions don't use the `HW_ERR` prefix:
```c
drm_warn(&xe->drm, "%s %s detected\n", name, severity_str);
```
But the original CSC handler and `hw_error_source_handler` do:
```c
drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported ...");
```
This inconsistency makes grep/filtering harder for sysadmins.
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2026-03-05 3:47 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-28 8:08 [PATCH v9 0/5] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
2026-02-28 8:08 ` [PATCH v9 1/5] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Riana Tauro
2026-02-28 16:04 ` Jakub Kicinski
2026-03-03 4:32 ` Claude review: " Claude Code Review Bot
2026-02-28 8:08 ` [PATCH v9 2/5] drm/xe/xe_drm_ras: Add support for XE DRM RAS Riana Tauro
2026-03-03 4:32 ` Claude review: " Claude Code Review Bot
2026-02-28 8:08 ` [PATCH v9 3/5] drm/xe/xe_hw_error: Integrate DRM RAS with hardware error handling Riana Tauro
2026-03-03 4:32 ` Claude review: " Claude Code Review Bot
2026-02-28 8:08 ` [PATCH v9 4/5] drm/xe/xe_hw_error: Add support for Core-Compute errors Riana Tauro
2026-03-03 4:32 ` Claude review: " Claude Code Review Bot
2026-02-28 8:08 ` [PATCH v9 5/5] drm/xe/xe_hw_error: Add support for PVC SoC errors Riana Tauro
2026-03-03 4:32 ` Claude review: " Claude Code Review Bot
2026-03-03 4:32 ` Claude review: Introduce DRM_RAS using generic netlink for RAS Claude Code Review Bot
-- strict thread matches above, loose matches on Subject: below --
2026-03-04 7:44 [PATCH v10 0/5] " Riana Tauro
2026-03-04 7:44 ` [PATCH v10 4/5] drm/xe/xe_hw_error: Add support for Core-Compute errors Riana Tauro
2026-03-05 3:47 ` Claude review: " Claude Code Review Bot
2026-02-23 6:05 [PATCH v8 0/5] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
2026-02-23 6:05 ` [PATCH v8 4/5] drm/xe/xe_hw_error: Add support for Core-Compute errors Riana Tauro
2026-02-24 0:45 ` Claude review: " Claude Code Review Bot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox