[PATCH 0/4] Add support for clear counter and error event in DRM RAS

public inbox for drm-ai-reviews@public-inbox.freedesktop.org
 help / color / mirror / Atom feed

* [PATCH 0/4] Add support for clear counter and error event in DRM RAS
@ 2026-03-11 10:29 Riana Tauro
  2026-03-11 10:29 ` [PATCH 1/4] drm/drm_ras: Add clear-error-counter netlink command to drm_ras Riana Tauro
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Riana Tauro @ 2026-03-11 10:29 UTC (permalink / raw)
  To: intel-xe, dri-devel, netdev
  Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
	simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
	raag.jadav, anvesh.bakwad, maarten.lankhorst, Riana Tauro

Clear Error Counter : Add clear-error-counter command to DRM RAS to clear
a specific error counter of a node. Implement the callback in XE driver
to demonstrate usage.

Usage with both get-error-counter and clear-error-counter:

$ sudo ynl --family drm_ras  --dump get-error-counter --json '{"node-id":1}'
[{'error-id': 1, 'error-name': 'core-compute', 'error-value': 0},
 {'error-id': 2, 'error-name': 'soc-internal', 'error-value': 3}]

$ sudo ynl --family drm_ras  --do clear-error-counter --json \
'{"node-id":1, "error-id":2}'
None

$ sudo ynl --family drm_ras  --dump get-error-counter --json '{"node-id":1}'
[{'error-id': 1, 'error-name': 'core-compute', 'error-value': 0},
 {'error-id': 2, 'error-name': 'soc-internal', 'error-value': 0}]

Error Event Support:  Introduce `error-event` support in DRM RAS to notify
userspace whenever an error occurs.

Each notification includes the node-id and error-id to identify
the source and type of the error. To receive notifications,
userspace must subscribe to the 'error-notify' multicast group.

Userspace can receive the event by subscribing to multicast group.

$ sudo ynl --family drm_ras --subscribe error-notify
{'msg': {'error-id': 2, 'node-id': 1}, 'name': 'error-event'}

Riana Tauro (4):
  drm/drm_ras: Add clear-error-counter netlink command to drm_ras
  drm/xe/xe_drm_ras: Add support for clear-error-counter in XE DRM RAS
  drm/drm_ras: Add DRM RAS netlink error event notification
  drm/xe/xe_drm_ras: Add error-event support in XE DRM RAS

 Documentation/gpu/drm-ras.rst            | 17 +++++
 Documentation/netlink/specs/drm_ras.yaml | 27 ++++++-
 drivers/gpu/drm/drm_ras.c                | 91 +++++++++++++++++++++++-
 drivers/gpu/drm/drm_ras_nl.c             | 19 +++++
 drivers/gpu/drm/drm_ras_nl.h             |  6 ++
 drivers/gpu/drm/xe/xe_drm_ras.c          | 52 +++++++++++++-
 drivers/gpu/drm/xe/xe_drm_ras.h          |  7 ++
 drivers/gpu/drm/xe/xe_hw_error.c         |  5 ++
 include/drm/drm_ras.h                    | 13 ++++
 include/uapi/drm/drm_ras.h               |  4 ++
 10 files changed, 237 insertions(+), 4 deletions(-)

-- 
2.47.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/4] drm/drm_ras: Add clear-error-counter netlink command to drm_ras
  2026-03-11 10:29 [PATCH 0/4] Add support for clear counter and error event in DRM RAS Riana Tauro
@ 2026-03-11 10:29 ` Riana Tauro
  2026-03-11 21:06   ` Claude review: " Claude Code Review Bot
  2026-03-11 10:29 ` [PATCH 2/4] drm/xe/xe_drm_ras: Add support for clear-error-counter in XE DRM RAS Riana Tauro
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: Riana Tauro @ 2026-03-11 10:29 UTC (permalink / raw)
  To: intel-xe, dri-devel, netdev
  Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
	simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
	raag.jadav, anvesh.bakwad, maarten.lankhorst, Riana Tauro,
	Jakub Kicinski, Zack McKevitt, Lijo Lazar, Hawking Zhang,
	David S. Miller, Paolo Abeni, Eric Dumazet

Introduce a new 'clear-error-counter' DRM RAS command to reset the counter
value for a specific error counter of a given node.

The command is a 'do' netlink request with 'node-id' and 'error-id'
as parameters with no additional response payload.

Usage

$ sudo ynl --family drm_ras  --do clear-error-counter --json \
'{"node-id":1, "error-id":1}'
None

Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
Cc: Lijo Lazar <lijo.lazar@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 Documentation/gpu/drm-ras.rst            |  8 +++++
 Documentation/netlink/specs/drm_ras.yaml | 13 ++++++-
 drivers/gpu/drm/drm_ras.c                | 43 +++++++++++++++++++++++-
 drivers/gpu/drm/drm_ras_nl.c             | 13 +++++++
 drivers/gpu/drm/drm_ras_nl.h             |  2 ++
 include/drm/drm_ras.h                    | 11 ++++++
 include/uapi/drm/drm_ras.h               |  1 +
 7 files changed, 89 insertions(+), 2 deletions(-)

diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
index 70b246a78fc8..4636e68f5678 100644
--- a/Documentation/gpu/drm-ras.rst
+++ b/Documentation/gpu/drm-ras.rst
@@ -52,6 +52,8 @@ User space tools can:
   as a parameter.
 * Query specific error counter values with the ``get-error-counter`` command, using both
   ``node-id`` and ``error-id`` as parameters.
+* Clear specific error counters with the ``clear-error-counter`` command, using both
+  ``node-id`` and ``error-id`` as parameters.
 
 YAML-based Interface
 --------------------
@@ -101,3 +103,9 @@ Example: Query an error counter for a given node
     sudo ynl --family drm_ras --do get-error-counter --json '{"node-id":0, "error-id":1}'
     {'error-id': 1, 'error-name': 'error_name1', 'error-value': 0}
 
+Example: Clear an error counter for a given node
+
+.. code-block:: bash
+
+    sudo ynl --family drm_ras --do clear-error-counter --json '{"node-id":0, "error-id":1}'
+    None
diff --git a/Documentation/netlink/specs/drm_ras.yaml b/Documentation/netlink/specs/drm_ras.yaml
index 79af25dac3c5..e113056f8c01 100644
--- a/Documentation/netlink/specs/drm_ras.yaml
+++ b/Documentation/netlink/specs/drm_ras.yaml
@@ -99,7 +99,7 @@ operations:
       flags: [admin-perm]
       do:
         request:
-          attributes:
+          attributes: &id-attrs
             - node-id
             - error-id
         reply:
@@ -113,3 +113,14 @@ operations:
             - node-id
         reply:
           attributes: *errorinfo
+    -
+      name: clear-error-counter
+      doc: >-
+           Clear error counter for a given node.
+           The request includes the error-id and node-id of the
+           counter to be cleared.
+      attribute-set: error-counter-attrs
+      flags: [admin-perm]
+      do:
+        request:
+          attributes: *id-attrs
diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
index b2fa5ab86d87..d6eab29a1394 100644
--- a/drivers/gpu/drm/drm_ras.c
+++ b/drivers/gpu/drm/drm_ras.c
@@ -26,7 +26,7 @@
  * efficient lookup by ID. Nodes can be registered or unregistered
  * dynamically at runtime.
  *
- * A Generic Netlink family `drm_ras` exposes two main operations to
+ * A Generic Netlink family `drm_ras` exposes the below operations to
  * userspace:
  *
  * 1. LIST_NODES: Dump all currently registered RAS nodes.
@@ -37,6 +37,10 @@
  *    Returns all counters of a node if only Node ID is provided or specific
  *    error counters.
  *
+ * 3. CLEAR_ERROR_COUNTER: Clear error counter of a given node.
+ *    Userspace must provide Node ID, Error ID.
+ *    Clears specific error counter of a node if supported.
+ *
  * Node registration:
  *
  * - drm_ras_node_register(): Registers a new node and assigns
@@ -66,6 +70,8 @@
  *   operation, fetching all counters from a specific node.
  * - drm_ras_nl_get_error_counter_doit(): Implements the GET_ERROR_COUNTER doit
  *   operation, fetching a counter value from a specific node.
+ * - drm_ras_nl_clear_error_counter_doit(): Implements the CLEAR_ERROR_COUNTER doit
+ *   operation, clearing a counter value from a specific node.
  */
 
 static DEFINE_XARRAY_ALLOC(drm_ras_xa);
@@ -314,6 +320,41 @@ int drm_ras_nl_get_error_counter_doit(struct sk_buff *skb,
 	return doit_reply_value(info, node_id, error_id);
 }
 
+/**
+ * drm_ras_nl_clear_error_counter_doit() - Clear an error counter of a node
+ * @skb: Netlink message buffer
+ * @info: Generic Netlink info containing attributes of the request
+ *
+ * Extracts the node ID and error ID from the netlink attributes and
+ * clears the current value.
+ *
+ * Return: 0 on success, or negative errno on failure.
+ */
+int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
+					struct genl_info *info)
+{
+	struct drm_ras_node *node;
+	u32 node_id, error_id;
+
+	if (!info->attrs ||
+	    GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID) ||
+	    GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID))
+		return -EINVAL;
+
+	node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
+	error_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]);
+
+	node = xa_load(&drm_ras_xa, node_id);
+	if (!node || !node->clear_error_counter)
+		return -ENOENT;
+
+	if (error_id < node->error_counter_range.first ||
+	    error_id > node->error_counter_range.last)
+		return -EINVAL;
+
+	return node->clear_error_counter(node, error_id);
+}
+
 /**
  * drm_ras_node_register() - Register a new RAS node
  * @node: Node structure to register
diff --git a/drivers/gpu/drm/drm_ras_nl.c b/drivers/gpu/drm/drm_ras_nl.c
index 16803d0c4a44..dea1c1b2494e 100644
--- a/drivers/gpu/drm/drm_ras_nl.c
+++ b/drivers/gpu/drm/drm_ras_nl.c
@@ -22,6 +22,12 @@ static const struct nla_policy drm_ras_get_error_counter_dump_nl_policy[DRM_RAS_
 	[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
 };
 
+/* DRM_RAS_CMD_CLEAR_ERROR_COUNTER - do */
+static const struct nla_policy drm_ras_clear_error_counter_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID + 1] = {
+	[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
+	[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
+};
+
 /* Ops table for drm_ras */
 static const struct genl_split_ops drm_ras_nl_ops[] = {
 	{
@@ -43,6 +49,13 @@ static const struct genl_split_ops drm_ras_nl_ops[] = {
 		.maxattr	= DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID,
 		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
 	},
+	{
+		.cmd		= DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
+		.doit		= drm_ras_nl_clear_error_counter_doit,
+		.policy		= drm_ras_clear_error_counter_nl_policy,
+		.maxattr	= DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
+		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
+	},
 };
 
 struct genl_family drm_ras_nl_family __ro_after_init = {
diff --git a/drivers/gpu/drm/drm_ras_nl.h b/drivers/gpu/drm/drm_ras_nl.h
index 06ccd9342773..a398643572a5 100644
--- a/drivers/gpu/drm/drm_ras_nl.h
+++ b/drivers/gpu/drm/drm_ras_nl.h
@@ -18,6 +18,8 @@ int drm_ras_nl_get_error_counter_doit(struct sk_buff *skb,
 				      struct genl_info *info);
 int drm_ras_nl_get_error_counter_dumpit(struct sk_buff *skb,
 					struct netlink_callback *cb);
+int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
+					struct genl_info *info);
 
 extern struct genl_family drm_ras_nl_family;
 
diff --git a/include/drm/drm_ras.h b/include/drm/drm_ras.h
index 5d50209e51db..f2a787bc4f64 100644
--- a/include/drm/drm_ras.h
+++ b/include/drm/drm_ras.h
@@ -58,6 +58,17 @@ struct drm_ras_node {
 	int (*query_error_counter)(struct drm_ras_node *node, u32 error_id,
 				   const char **name, u32 *val);
 
+	/**
+	 * @clear_error_counter:
+	 *
+	 * This callback is used by drm_ras to clear a specific error counter.
+	 * Driver should implement this callback to support clearing error counters
+	 * of a node.
+	 *
+	 * Returns: 0 on success, negative error code on failure.
+	 */
+	int (*clear_error_counter)(struct drm_ras_node *node, u32 error_id);
+
 	/** @priv: Driver private data */
 	void *priv;
 };
diff --git a/include/uapi/drm/drm_ras.h b/include/uapi/drm/drm_ras.h
index 5f40fa5b869d..218a3ee86805 100644
--- a/include/uapi/drm/drm_ras.h
+++ b/include/uapi/drm/drm_ras.h
@@ -41,6 +41,7 @@ enum {
 enum {
 	DRM_RAS_CMD_LIST_NODES = 1,
 	DRM_RAS_CMD_GET_ERROR_COUNTER,
+	DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
 
 	__DRM_RAS_CMD_MAX,
 	DRM_RAS_CMD_MAX = (__DRM_RAS_CMD_MAX - 1)
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Claude review: drm/drm_ras: Add clear-error-counter netlink command to drm_ras
  2026-03-11 10:29 ` [PATCH 1/4] drm/drm_ras: Add clear-error-counter netlink command to drm_ras Riana Tauro
@ 2026-03-11 21:06   ` Claude Code Review Bot
  0 siblings, 0 replies; 10+ messages in thread
From: Claude Code Review Bot @ 2026-03-11 21:06 UTC (permalink / raw)
  To: dri-devel-reviews

Patch Review

Overall a clean patch. The YAML spec, documentation, and uapi additions look correct.

**Concurrency concern in `drm_ras_nl_clear_error_counter_doit`:**

```c
node = xa_load(&drm_ras_xa, node_id);
if (!node || !node->clear_error_counter)
    return -ENOENT;
```

The same concern likely exists in the pre-existing `get_error_counter` paths, but there is no locking or RCU protection around the `xa_load` + subsequent use of `node`. If a node is unregistered concurrently, this could use-after-free. This is a pre-existing design issue, but adding a new operation makes it worth mentioning. Consider using `xa_lock`/`xa_unlock` or RCU to protect node lifetime.

**Return value for unsupported clear operation:**

```c
if (!node || !node->clear_error_counter)
    return -ENOENT;
```

Returning `-ENOENT` when `clear_error_counter` is NULL (i.e., the node exists but doesn't support clearing) is slightly misleading. `-EOPNOTSUPP` would be more appropriate when the node exists but the operation is not implemented.

**YAML anchor reuse is a nice touch:**

```yaml
          attributes: &id-attrs
            - node-id
            - error-id
```

Using `&id-attrs` / `*id-attrs` to avoid duplicating the attribute list is good.

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 2/4] drm/xe/xe_drm_ras: Add support for clear-error-counter in XE DRM RAS
  2026-03-11 10:29 [PATCH 0/4] Add support for clear counter and error event in DRM RAS Riana Tauro
  2026-03-11 10:29 ` [PATCH 1/4] drm/drm_ras: Add clear-error-counter netlink command to drm_ras Riana Tauro
@ 2026-03-11 10:29 ` Riana Tauro
  2026-03-11 21:06   ` Claude review: " Claude Code Review Bot
  2026-03-11 10:29 ` [PATCH 3/4] drm/drm_ras: Add DRM RAS netlink error event notification Riana Tauro
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: Riana Tauro @ 2026-03-11 10:29 UTC (permalink / raw)
  To: intel-xe, dri-devel, netdev
  Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
	simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
	raag.jadav, anvesh.bakwad, maarten.lankhorst, Riana Tauro

Add support for clear-error-counter command in XE DRM RAS.
This resets the counter value.

Usage:

$ sudo ynl --family drm_ras  --do clear-error-counter --json \
'{"node-id":1, "error-id":1}'
None

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 drivers/gpu/drm/xe/xe_drm_ras.c | 35 +++++++++++++++++++++++++++++++--
 1 file changed, 33 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_drm_ras.c b/drivers/gpu/drm/xe/xe_drm_ras.c
index e07dc23a155e..c21c8b428de6 100644
--- a/drivers/gpu/drm/xe/xe_drm_ras.c
+++ b/drivers/gpu/drm/xe/xe_drm_ras.c
@@ -27,6 +27,16 @@ static int hw_query_error_counter(struct xe_drm_ras_counter *info,
 	return 0;
 }
 
+static int hw_clear_error_counter(struct xe_drm_ras_counter *info, u32 error_id)
+{
+	if (!info || !info[error_id].name)
+		return -ENOENT;
+
+	atomic_set(&info[error_id].counter, 0);
+
+	return 0;
+}
+
 static int query_uncorrectable_error_counter(struct drm_ras_node *ep, u32 error_id,
 					     const char **name, u32 *val)
 {
@@ -37,6 +47,15 @@ static int query_uncorrectable_error_counter(struct drm_ras_node *ep, u32 error_
 	return hw_query_error_counter(info, error_id, name, val);
 }
 
+static int clear_uncorrectable_error_counter(struct drm_ras_node *node, u32 error_id)
+{
+	struct xe_device *xe = node->priv;
+	struct xe_drm_ras *ras = &xe->ras;
+	struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_UNCORRECTABLE];
+
+	return hw_clear_error_counter(info, error_id);
+}
+
 static int query_correctable_error_counter(struct drm_ras_node *ep, u32 error_id,
 					   const char **name, u32 *val)
 {
@@ -47,6 +66,15 @@ static int query_correctable_error_counter(struct drm_ras_node *ep, u32 error_id
 	return hw_query_error_counter(info, error_id, name, val);
 }
 
+static int clear_correctable_error_counter(struct drm_ras_node *node, u32 error_id)
+{
+	struct xe_device *xe = node->priv;
+	struct xe_drm_ras *ras = &xe->ras;
+	struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_CORRECTABLE];
+
+	return hw_clear_error_counter(info, error_id);
+}
+
 static struct xe_drm_ras_counter *allocate_and_copy_counters(struct xe_device *xe)
 {
 	struct xe_drm_ras_counter *counter;
@@ -92,10 +120,13 @@ static int assign_node_params(struct xe_device *xe, struct drm_ras_node *node,
 	if (IS_ERR(ras->info[severity]))
 		return PTR_ERR(ras->info[severity]);
 
-	if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE)
+	if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE) {
 		node->query_error_counter = query_correctable_error_counter;
-	else
+		node->clear_error_counter = clear_correctable_error_counter;
+	} else {
 		node->query_error_counter = query_uncorrectable_error_counter;
+		node->clear_error_counter = clear_uncorrectable_error_counter;
+	}
 
 	return 0;
 }
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Claude review: drm/xe/xe_drm_ras: Add support for clear-error-counter in XE DRM RAS
  2026-03-11 10:29 ` [PATCH 2/4] drm/xe/xe_drm_ras: Add support for clear-error-counter in XE DRM RAS Riana Tauro
@ 2026-03-11 21:06   ` Claude Code Review Bot
  0 siblings, 0 replies; 10+ messages in thread
From: Claude Code Review Bot @ 2026-03-11 21:06 UTC (permalink / raw)
  To: dri-devel-reviews

Patch Review

Straightforward implementation that mirrors the existing query pattern.

**Atomicity concern in `hw_clear_error_counter`:**

```c
atomic_set(&info[error_id].counter, 0);
```

Using `atomic_set` to clear is fine for a simple reset-to-zero. However, if the counter can be incremented concurrently (e.g., from an IRQ handler via `atomic_inc`), the clear could race with an increment and the increment could be silently lost. This may be acceptable for RAS counters where "best effort" clearing is fine, but it's worth documenting the intent. If exact clearing semantics are needed, `atomic_xchg` and acknowledging the return value might be more appropriate.

**Code duplication between correctable/uncorrectable:**

```c
static int clear_uncorrectable_error_counter(struct drm_ras_node *node, u32 error_id)
{
    struct xe_device *xe = node->priv;
    struct xe_drm_ras *ras = &xe->ras;
    struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_UNCORRECTABLE];
    return hw_clear_error_counter(info, error_id);
}

static int clear_correctable_error_counter(struct drm_ras_node *node, u32 error_id)
{
    ...
    struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_CORRECTABLE];
    return hw_clear_error_counter(info, error_id);
}
```

This mirrors the existing duplication in the query callbacks so it's consistent, but the pattern of having two nearly identical functions differing only by a severity constant is a bit unfortunate. Not a blocker since it matches the existing code style.

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 3/4] drm/drm_ras: Add DRM RAS netlink error event notification
  2026-03-11 10:29 [PATCH 0/4] Add support for clear counter and error event in DRM RAS Riana Tauro
  2026-03-11 10:29 ` [PATCH 1/4] drm/drm_ras: Add clear-error-counter netlink command to drm_ras Riana Tauro
  2026-03-11 10:29 ` [PATCH 2/4] drm/xe/xe_drm_ras: Add support for clear-error-counter in XE DRM RAS Riana Tauro
@ 2026-03-11 10:29 ` Riana Tauro
  2026-03-11 21:06   ` Claude review: " Claude Code Review Bot
  2026-03-11 10:29 ` [PATCH 4/4] drm/xe/xe_drm_ras: Add error-event support in XE DRM RAS Riana Tauro
  2026-03-11 21:06 ` Claude review: Add support for clear counter and error event in " Claude Code Review Bot
  4 siblings, 1 reply; 10+ messages in thread
From: Riana Tauro @ 2026-03-11 10:29 UTC (permalink / raw)
  To: intel-xe, dri-devel, netdev
  Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
	simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
	raag.jadav, anvesh.bakwad, maarten.lankhorst, Riana Tauro,
	Jakub Kicinski, Zack McKevitt, Lijo Lazar, Hawking Zhang,
	David S. Miller, Paolo Abeni, Eric Dumazet

Add support for asynchronous error notifications in drm_ras.

Define a new `error-event` netlink event and a new multicast
group `error-notify` in drm_ras spec. Each event contains
a node-id and error-id to identify the type and source
of error.

Add drm_ras_error_notify() to trigger this event from drivers.
Userspace can receive this event by subscribing to the
multicast group error-notify.

Example: Using ynl tool

$ sudo ynl --family drm_ras --subscribe error-notify

Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
Cc: Lijo Lazar <lijo.lazar@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 Documentation/gpu/drm-ras.rst            |  9 +++++
 Documentation/netlink/specs/drm_ras.yaml | 14 +++++++
 drivers/gpu/drm/drm_ras.c                | 48 ++++++++++++++++++++++++
 drivers/gpu/drm/drm_ras_nl.c             |  6 +++
 drivers/gpu/drm/drm_ras_nl.h             |  4 ++
 include/drm/drm_ras.h                    |  2 +
 include/uapi/drm/drm_ras.h               |  3 ++
 7 files changed, 86 insertions(+)

diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
index 4636e68f5678..09b2918f67bd 100644
--- a/Documentation/gpu/drm-ras.rst
+++ b/Documentation/gpu/drm-ras.rst
@@ -54,6 +54,8 @@ User space tools can:
   ``node-id`` and ``error-id`` as parameters.
 * Clear specific error counters with the ``clear-error-counter`` command, using both
   ``node-id`` and ``error-id`` as parameters.
+* Listen to ``error-event`` notifications for error events by subscribing to the
+  ``error-notify`` multicast group.
 
 YAML-based Interface
 --------------------
@@ -109,3 +111,10 @@ Example: Clear an error counter for a given node
 
     sudo ynl --family drm_ras --do clear-error-counter --json '{"node-id":0, "error-id":1}'
     None
+
+Example: Listen to error events
+
+.. code-block:: bash
+
+    sudo ynl --family drm_ras --subscribe error-notify
+    {'msg': {'error-id': 1, 'node-id': 1}, 'name': 'error-event'}
diff --git a/Documentation/netlink/specs/drm_ras.yaml b/Documentation/netlink/specs/drm_ras.yaml
index e113056f8c01..4dc047be59e9 100644
--- a/Documentation/netlink/specs/drm_ras.yaml
+++ b/Documentation/netlink/specs/drm_ras.yaml
@@ -124,3 +124,17 @@ operations:
       do:
         request:
           attributes: *id-attrs
+    -
+      name: error-event
+      doc: >-
+           Notify userspace of an error event.
+           The event includes the error-id and node-id of the error
+           that triggered the event.
+      attribute-set: error-counter-attrs
+      event:
+        attributes: *id-attrs
+
+mcast-groups:
+  list:
+    -
+      name: error-notify
diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
index d6eab29a1394..36a3a79cbbea 100644
--- a/drivers/gpu/drm/drm_ras.c
+++ b/drivers/gpu/drm/drm_ras.c
@@ -41,6 +41,10 @@
  *    Userspace must provide Node ID, Error ID.
  *    Clears specific error counter of a node if supported.
  *
+ * 4. ERROR_EVENT: Notify userspace of an error event.
+ *    The event includes the error-id and node-id of the error
+ *    that triggered the event.
+ *
  * Node registration:
  *
  * - drm_ras_node_register(): Registers a new node and assigns
@@ -355,6 +359,50 @@ int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
 	return node->clear_error_counter(node, error_id);
 }
 
+/**
+ * drm_ras_error_notify() - Notify userspace of an error event
+ * @node: Node structure
+ * @error_id: ID of the error counter that triggered the event
+ * @flags: GFP flags for memory allocation
+ *
+ * Notifies userspace of an error event related to a specific RAS node and error counter.
+ */
+void drm_ras_error_notify(struct drm_ras_node *node, u32 error_id, gfp_t flags)
+{
+	struct genl_info info;
+	struct sk_buff *msg;
+	struct nlattr *hdr;
+	int ret;
+
+	genl_info_init_ntf(&info, &drm_ras_nl_family, DRM_RAS_CMD_ERROR_EVENT);
+
+	msg = genlmsg_new(NLMSG_GOODSIZE, flags);
+	if (!msg)
+		return;
+
+	hdr = genlmsg_iput(msg, &info);
+	if (!hdr)
+		goto err_free;
+
+	ret = nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID, node->id);
+	if (ret)
+		goto err_cancel;
+
+	ret = nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID, error_id);
+	if (ret)
+		goto err_cancel;
+
+	genlmsg_end(msg, hdr);
+	genlmsg_multicast(&drm_ras_nl_family, msg, 0, DRM_RAS_NLGRP_ERROR_NOTIFY, flags);
+	return;
+
+err_cancel:
+	genlmsg_cancel(msg, hdr);
+err_free:
+	nlmsg_free(msg);
+}
+EXPORT_SYMBOL(drm_ras_error_notify);
+
 /**
  * drm_ras_node_register() - Register a new RAS node
  * @node: Node structure to register
diff --git a/drivers/gpu/drm/drm_ras_nl.c b/drivers/gpu/drm/drm_ras_nl.c
index dea1c1b2494e..ac724bb87a3b 100644
--- a/drivers/gpu/drm/drm_ras_nl.c
+++ b/drivers/gpu/drm/drm_ras_nl.c
@@ -58,6 +58,10 @@ static const struct genl_split_ops drm_ras_nl_ops[] = {
 	},
 };
 
+static const struct genl_multicast_group drm_ras_nl_mcgrps[] = {
+	[DRM_RAS_NLGRP_ERROR_NOTIFY] = { "error-notify", },
+};
+
 struct genl_family drm_ras_nl_family __ro_after_init = {
 	.name		= DRM_RAS_FAMILY_NAME,
 	.version	= DRM_RAS_FAMILY_VERSION,
@@ -66,4 +70,6 @@ struct genl_family drm_ras_nl_family __ro_after_init = {
 	.module		= THIS_MODULE,
 	.split_ops	= drm_ras_nl_ops,
 	.n_split_ops	= ARRAY_SIZE(drm_ras_nl_ops),
+	.mcgrps		= drm_ras_nl_mcgrps,
+	.n_mcgrps	= ARRAY_SIZE(drm_ras_nl_mcgrps),
 };
diff --git a/drivers/gpu/drm/drm_ras_nl.h b/drivers/gpu/drm/drm_ras_nl.h
index a398643572a5..17e1af8cc3b3 100644
--- a/drivers/gpu/drm/drm_ras_nl.h
+++ b/drivers/gpu/drm/drm_ras_nl.h
@@ -21,6 +21,10 @@ int drm_ras_nl_get_error_counter_dumpit(struct sk_buff *skb,
 int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
 					struct genl_info *info);
 
+enum {
+	DRM_RAS_NLGRP_ERROR_NOTIFY,
+};
+
 extern struct genl_family drm_ras_nl_family;
 
 #endif /* _LINUX_DRM_RAS_GEN_H */
diff --git a/include/drm/drm_ras.h b/include/drm/drm_ras.h
index f2a787bc4f64..a2d4f257c9c2 100644
--- a/include/drm/drm_ras.h
+++ b/include/drm/drm_ras.h
@@ -78,9 +78,11 @@ struct drm_device;
 #if IS_ENABLED(CONFIG_DRM_RAS)
 int drm_ras_node_register(struct drm_ras_node *node);
 void drm_ras_node_unregister(struct drm_ras_node *node);
+void drm_ras_error_notify(struct drm_ras_node *node, u32 error_id, gfp_t flags);
 #else
 static inline int drm_ras_node_register(struct drm_ras_node *node) { return 0; }
 static inline void drm_ras_node_unregister(struct drm_ras_node *node) { }
+static inline void drm_ras_error_notify(struct drm_ras_node *node, u32 error_id, gfp_t flags) { }
 #endif
 
 #endif
diff --git a/include/uapi/drm/drm_ras.h b/include/uapi/drm/drm_ras.h
index 218a3ee86805..47fafeff93e7 100644
--- a/include/uapi/drm/drm_ras.h
+++ b/include/uapi/drm/drm_ras.h
@@ -42,9 +42,12 @@ enum {
 	DRM_RAS_CMD_LIST_NODES = 1,
 	DRM_RAS_CMD_GET_ERROR_COUNTER,
 	DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
+	DRM_RAS_CMD_ERROR_EVENT,
 
 	__DRM_RAS_CMD_MAX,
 	DRM_RAS_CMD_MAX = (__DRM_RAS_CMD_MAX - 1)
 };
 
+#define DRM_RAS_MCGRP_ERROR_NOTIFY	"error-notify"
+
 #endif /* _UAPI_LINUX_DRM_RAS_H */
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Claude review: drm/drm_ras: Add DRM RAS netlink error event notification
  2026-03-11 10:29 ` [PATCH 3/4] drm/drm_ras: Add DRM RAS netlink error event notification Riana Tauro
@ 2026-03-11 21:06   ` Claude Code Review Bot
  0 siblings, 0 replies; 10+ messages in thread
From: Claude Code Review Bot @ 2026-03-11 21:06 UTC (permalink / raw)
  To: dri-devel-reviews

Patch Review

This is the core event notification infrastructure. The implementation follows standard genl multicast patterns.

**`drm_ras_error_notify` looks correct:**

```c
void drm_ras_error_notify(struct drm_ras_node *node, u32 error_id, gfp_t flags)
{
    ...
    genl_info_init_ntf(&info, &drm_ras_nl_family, DRM_RAS_CMD_ERROR_EVENT);
    msg = genlmsg_new(NLMSG_GOODSIZE, flags);
    ...
    genlmsg_multicast(&drm_ras_nl_family, msg, 0, DRM_RAS_NLGRP_ERROR_NOTIFY, flags);
```

The function properly accepts `gfp_t flags` which is important since it will be called from atomic context (patch 4 calls it with `GFP_ATOMIC`). The error handling with `goto err_cancel` / `goto err_free` is correct.

**`DRM_RAS_NLGRP_ERROR_NOTIFY` placement:**

```c
enum {
    DRM_RAS_NLGRP_ERROR_NOTIFY,
};
```

This enum is defined in `drm_ras_nl.h` which is a driver-internal header. That's fine since the multicast group index is kernel-internal; userspace uses the string name `"error-notify"` to subscribe.

**`EXPORT_SYMBOL` vs `EXPORT_SYMBOL_GPL`:**

```c
EXPORT_SYMBOL(drm_ras_error_notify);
```

The existing RAS framework presumably uses one or the other consistently. If the base series uses `EXPORT_SYMBOL_GPL` for other RAS functions, this should match.

**No validation of `node` or `error_id`:**

The function doesn't validate that `node` is non-NULL or that `error_id` is within range. Since this is a DRM-internal API called only by drivers, this is likely acceptable, but a brief note or `WARN_ON(!node)` might help catch driver bugs.

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 4/4] drm/xe/xe_drm_ras: Add error-event support in XE DRM RAS
  2026-03-11 10:29 [PATCH 0/4] Add support for clear counter and error event in DRM RAS Riana Tauro
                   ` (2 preceding siblings ...)
  2026-03-11 10:29 ` [PATCH 3/4] drm/drm_ras: Add DRM RAS netlink error event notification Riana Tauro
@ 2026-03-11 10:29 ` Riana Tauro
  2026-03-11 21:06   ` Claude review: " Claude Code Review Bot
  2026-03-11 21:06 ` Claude review: Add support for clear counter and error event in " Claude Code Review Bot
  4 siblings, 1 reply; 10+ messages in thread
From: Riana Tauro @ 2026-03-11 10:29 UTC (permalink / raw)
  To: intel-xe, dri-devel, netdev
  Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
	simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
	raag.jadav, anvesh.bakwad, maarten.lankhorst, Riana Tauro

Add error-event support in XE DRM RAS to notify userspace
whenever a GT or SoC error occurs.

$ sudo ynl --family drm_ras --subscribe error-notify
{'msg': {'error-id': 1, 'node-id': 1}, 'name': 'error-event'}

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 drivers/gpu/drm/xe/xe_drm_ras.c  | 17 +++++++++++++++++
 drivers/gpu/drm/xe/xe_drm_ras.h  |  7 +++++++
 drivers/gpu/drm/xe/xe_hw_error.c |  5 +++++
 3 files changed, 29 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_drm_ras.c b/drivers/gpu/drm/xe/xe_drm_ras.c
index c21c8b428de6..47c040c80175 100644
--- a/drivers/gpu/drm/xe/xe_drm_ras.c
+++ b/drivers/gpu/drm/xe/xe_drm_ras.c
@@ -181,6 +181,23 @@ static void xe_drm_ras_unregister_nodes(struct drm_device *device, void *arg)
 	}
 }
 
+/**
+ * xe_drm_ras_notify() - Notify userspace of an error event
+ * @ras:  ras structure
+ * @error_id: error id
+ * @severity: error severity
+ * @flags: flags for allocation
+ *
+ * Notifies userspace of an error.
+ */
+void xe_drm_ras_notify(struct xe_drm_ras *ras, u32 error_id,
+		       const enum drm_xe_ras_error_severity severity, gfp_t flags)
+{
+	struct drm_ras_node *node = &ras->node[severity];
+
+	drm_ras_error_notify(node, error_id, flags);
+}
+
 /**
  * xe_drm_ras_init() - Initialize DRM RAS
  * @xe: xe device instance
diff --git a/drivers/gpu/drm/xe/xe_drm_ras.h b/drivers/gpu/drm/xe/xe_drm_ras.h
index 5cc8f0124411..ac347d0d63eb 100644
--- a/drivers/gpu/drm/xe/xe_drm_ras.h
+++ b/drivers/gpu/drm/xe/xe_drm_ras.h
@@ -5,11 +5,18 @@
 #ifndef XE_DRM_RAS_H_
 #define XE_DRM_RAS_H_
 
+#include <linux/types.h>
+
+#include <drm/xe_drm.h>
+
 struct xe_device;
+struct xe_drm_ras;
 
 #define for_each_error_severity(i)	\
 	for (i = 0; i < DRM_XE_RAS_ERR_SEV_MAX; i++)
 
 int xe_drm_ras_init(struct xe_device *xe);
+void xe_drm_ras_notify(struct xe_drm_ras *ras, u32 error_id,
+		       const enum drm_xe_ras_error_severity severity, gfp_t flags);
 
 #endif
diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
index 2a31b430570e..17424e07e72c 100644
--- a/drivers/gpu/drm/xe/xe_hw_error.c
+++ b/drivers/gpu/drm/xe/xe_hw_error.c
@@ -332,6 +332,8 @@ static void gt_hw_error_handler(struct xe_tile *tile, const enum hardware_error
 
 		xe_mmio_write32(mmio, ERR_STAT_GT_VECTOR_REG(hw_err, i), vector);
 	}
+
+	xe_drm_ras_notify(ras, error_id, severity, GFP_ATOMIC);
 }
 
 static void soc_slave_ieh_handler(struct xe_tile *tile, const enum hardware_error hw_err, u32 error_id)
@@ -368,6 +370,7 @@ static void soc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
 {
 	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
 	struct xe_device *xe = tile_to_xe(tile);
+	struct xe_drm_ras *ras = &xe->ras;
 	struct xe_mmio *mmio = &tile->mmio;
 	unsigned long master_global_errstat, master_local_errstat;
 	u32 master, slave, regbit;
@@ -418,6 +421,8 @@ static void soc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
 	for (i = 0; i < XE_SOC_NUM_IEH; i++)
 		xe_mmio_write32(mmio, SOC_GSYSEVTCTL_REG(master, slave, i),
 				(HARDWARE_ERROR_MAX << 1) + 1);
+
+	xe_drm_ras_notify(ras, error_id, severity, GFP_ATOMIC);
 }
 
 static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Claude review: drm/xe/xe_drm_ras: Add error-event support in XE DRM RAS
  2026-03-11 10:29 ` [PATCH 4/4] drm/xe/xe_drm_ras: Add error-event support in XE DRM RAS Riana Tauro
@ 2026-03-11 21:06   ` Claude Code Review Bot
  0 siblings, 0 replies; 10+ messages in thread
From: Claude Code Review Bot @ 2026-03-11 21:06 UTC (permalink / raw)
  To: dri-devel-reviews

Patch Review

**Critical concern — `error_id` and `ras` variable scope in `gt_hw_error_handler`:**

```c
xe_drm_ras_notify(ras, error_id, severity, GFP_ATOMIC);
```

This is called at the end of `gt_hw_error_handler`. Looking at the patch context, `ras` and `error_id` are local variables in that function. The `error_id` appears to be derived from the error vector processing loop above. However, by the time we reach the notify call at the end of the function (after the loop), `error_id` holds the *last* error ID that was processed. If multiple error IDs triggered simultaneously, only the last one gets notified. This seems like a bug — the notification should likely be inside the loop body, so each error ID triggers its own notification.

**Same concern in `soc_hw_error_handler`:**

```c
xe_drm_ras_notify(ras, error_id, severity, GFP_ATOMIC);
```

Again called after the loop, so only the last processed `error_id` gets notified. The same fix would apply — move the notification inside the error processing loop.

**`GFP_ATOMIC` usage is correct** since both `gt_hw_error_handler` and `soc_hw_error_handler` are called from interrupt context (under `spin_lock_irqsave` in `hw_error_source_handler`).

**`xe_drm_ras_notify` wrapper is thin but useful:**

```c
void xe_drm_ras_notify(struct xe_drm_ras *ras, u32 error_id,
                       const enum drm_xe_ras_error_severity severity, gfp_t flags)
{
    struct drm_ras_node *node = &ras->node[severity];
    drm_ras_error_notify(node, error_id, flags);
}
```

This provides a clean mapping from the XE-specific severity-based node array to the generic DRM RAS node. The `const` on the `severity` parameter in the declaration is unnecessary (it's a value parameter) but harmless.

**Missing bounds check on `severity`:** If an invalid `severity` value is passed, `ras->node[severity]` would access out of bounds. The callers currently pass values derived from `hw_err_to_severity()` which should be safe, but a defensive check wouldn't hurt.

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Claude review: Add support for clear counter and error event in DRM RAS
  2026-03-11 10:29 [PATCH 0/4] Add support for clear counter and error event in DRM RAS Riana Tauro
                   ` (3 preceding siblings ...)
  2026-03-11 10:29 ` [PATCH 4/4] drm/xe/xe_drm_ras: Add error-event support in XE DRM RAS Riana Tauro
@ 2026-03-11 21:06 ` Claude Code Review Bot
  4 siblings, 0 replies; 10+ messages in thread
From: Claude Code Review Bot @ 2026-03-11 21:06 UTC (permalink / raw)
  To: dri-devel-reviews

Overall Series Review

Subject: Add support for clear counter and error event in DRM RAS
Author: Riana Tauro <riana.tauro@intel.com>
Patches: 5
Reviewed: 2026-03-12T07:06:12.645794

---

This series adds two new features to the DRM RAS (Reliability, Availability, Serviceability) netlink framework: a `clear-error-counter` command and an `error-event` multicast notification. The series is well-structured with the DRM core changes preceding the XE driver implementations. The code is generally clean, but there are a few concerns around concurrency/locking in the clear path, lack of node lifetime protection (no RCU/refcount on `xa_load`), and whether the `error_id` variable is actually valid at the point of notification in patch 4.

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-03-11 21:06 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-11 10:29 [PATCH 0/4] Add support for clear counter and error event in DRM RAS Riana Tauro
2026-03-11 10:29 ` [PATCH 1/4] drm/drm_ras: Add clear-error-counter netlink command to drm_ras Riana Tauro
2026-03-11 21:06   ` Claude review: " Claude Code Review Bot
2026-03-11 10:29 ` [PATCH 2/4] drm/xe/xe_drm_ras: Add support for clear-error-counter in XE DRM RAS Riana Tauro
2026-03-11 21:06   ` Claude review: " Claude Code Review Bot
2026-03-11 10:29 ` [PATCH 3/4] drm/drm_ras: Add DRM RAS netlink error event notification Riana Tauro
2026-03-11 21:06   ` Claude review: " Claude Code Review Bot
2026-03-11 10:29 ` [PATCH 4/4] drm/xe/xe_drm_ras: Add error-event support in XE DRM RAS Riana Tauro
2026-03-11 21:06   ` Claude review: " Claude Code Review Bot
2026-03-11 21:06 ` Claude review: Add support for clear counter and error event in " Claude Code Review Bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox