* Claude review: drm/drm_ras: Add clear-error-counter netlink command to drm_ras
2026-03-11 10:29 ` [PATCH 1/4] drm/drm_ras: Add clear-error-counter netlink command to drm_ras Riana Tauro
@ 2026-03-11 21:06 ` Claude Code Review Bot
0 siblings, 0 replies; 12+ messages in thread
From: Claude Code Review Bot @ 2026-03-11 21:06 UTC (permalink / raw)
To: dri-devel-reviews
Patch Review
Overall a clean patch. The YAML spec, documentation, and uapi additions look correct.
**Concurrency concern in `drm_ras_nl_clear_error_counter_doit`:**
```c
node = xa_load(&drm_ras_xa, node_id);
if (!node || !node->clear_error_counter)
return -ENOENT;
```
The same concern likely exists in the pre-existing `get_error_counter` paths, but there is no locking or RCU protection around the `xa_load` + subsequent use of `node`. If a node is unregistered concurrently, this could use-after-free. This is a pre-existing design issue, but adding a new operation makes it worth mentioning. Consider using `xa_lock`/`xa_unlock` or RCU to protect node lifetime.
**Return value for unsupported clear operation:**
```c
if (!node || !node->clear_error_counter)
return -ENOENT;
```
Returning `-ENOENT` when `clear_error_counter` is NULL (i.e., the node exists but doesn't support clearing) is slightly misleading. `-EOPNOTSUPP` would be more appropriate when the node exists but the operation is not implemented.
**YAML anchor reuse is a nice touch:**
```yaml
attributes: &id-attrs
- node-id
- error-id
```
Using `&id-attrs` / `*id-attrs` to avoid duplicating the attribute list is good.
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 1/2] drm/drm_ras: Add clear-error-counter netlink command to drm_ras
2026-04-09 7:33 ` [PATCH v2 1/2] drm/drm_ras: Add clear-error-counter netlink " Riana Tauro
@ 2026-04-09 7:21 ` Tauro, Riana
2026-04-09 13:37 ` Rodrigo Vivi
2026-04-09 23:01 ` Zack McKevitt
2026-04-12 1:34 ` Claude review: " Claude Code Review Bot
1 sibling, 2 replies; 12+ messages in thread
From: Tauro, Riana @ 2026-04-09 7:21 UTC (permalink / raw)
To: intel-xe, dri-devel, netdev, rodrigo.vivi, Zack McKevitt,
joonas.lahtinen, aravind.iddamsetty
Cc: anshuman.gupta, simona.vetter, airlied, pratik.bari,
joshua.santosh.ranjan, ashwin.kumar.kulkarni, shubham.kumar,
ravi.kishore.koppuravuri, raag.jadav, anvesh.bakwad,
maarten.lankhorst, Jakub Kicinski, Lijo Lazar, Hawking Zhang,
David S. Miller, Paolo Abeni, Eric Dumazet
Hi Zack
Could you please take a look at this patch if applicable to your
usecase. Please let me know if any
changes are required
@Rodrigo This is already reviewed by Jakub and Raag.
If there are no opens, can this be merged via drm_misc
Thanks
Riana
On 4/9/2026 1:03 PM, Riana Tauro wrote:
> Introduce a new 'clear-error-counter' drm_ras command to reset the counter
> value for a specific error counter of a given node.
>
> The command is a 'do' netlink request with 'node-id' and 'error-id'
> as parameters with no response payload.
>
> Usage:
>
> $ sudo ynl --family drm_ras --do clear-error-counter --json \
> '{"node-id":1, "error-id":1}'
> None
>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
> Cc: Lijo Lazar <lijo.lazar@amd.com>
> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> Reviewed-by: Jakub Kicinski <kuba@kernel.org>
> Reviewed-by: Raag Jadav <raag.jadav@intel.com>
> ---
> Documentation/gpu/drm-ras.rst | 8 +++++
> Documentation/netlink/specs/drm_ras.yaml | 13 ++++++-
> drivers/gpu/drm/drm_ras.c | 43 +++++++++++++++++++++++-
> drivers/gpu/drm/drm_ras_nl.c | 13 +++++++
> drivers/gpu/drm/drm_ras_nl.h | 2 ++
> include/drm/drm_ras.h | 11 ++++++
> include/uapi/drm/drm_ras.h | 1 +
> 7 files changed, 89 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
> index 70b246a78fc8..4636e68f5678 100644
> --- a/Documentation/gpu/drm-ras.rst
> +++ b/Documentation/gpu/drm-ras.rst
> @@ -52,6 +52,8 @@ User space tools can:
> as a parameter.
> * Query specific error counter values with the ``get-error-counter`` command, using both
> ``node-id`` and ``error-id`` as parameters.
> +* Clear specific error counters with the ``clear-error-counter`` command, using both
> + ``node-id`` and ``error-id`` as parameters.
>
> YAML-based Interface
> --------------------
> @@ -101,3 +103,9 @@ Example: Query an error counter for a given node
> sudo ynl --family drm_ras --do get-error-counter --json '{"node-id":0, "error-id":1}'
> {'error-id': 1, 'error-name': 'error_name1', 'error-value': 0}
>
> +Example: Clear an error counter for a given node
> +
> +.. code-block:: bash
> +
> + sudo ynl --family drm_ras --do clear-error-counter --json '{"node-id":0, "error-id":1}'
> + None
> diff --git a/Documentation/netlink/specs/drm_ras.yaml b/Documentation/netlink/specs/drm_ras.yaml
> index 79af25dac3c5..e113056f8c01 100644
> --- a/Documentation/netlink/specs/drm_ras.yaml
> +++ b/Documentation/netlink/specs/drm_ras.yaml
> @@ -99,7 +99,7 @@ operations:
> flags: [admin-perm]
> do:
> request:
> - attributes:
> + attributes: &id-attrs
> - node-id
> - error-id
> reply:
> @@ -113,3 +113,14 @@ operations:
> - node-id
> reply:
> attributes: *errorinfo
> + -
> + name: clear-error-counter
> + doc: >-
> + Clear error counter for a given node.
> + The request includes the error-id and node-id of the
> + counter to be cleared.
> + attribute-set: error-counter-attrs
> + flags: [admin-perm]
> + do:
> + request:
> + attributes: *id-attrs
> diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
> index b2fa5ab86d87..d6eab29a1394 100644
> --- a/drivers/gpu/drm/drm_ras.c
> +++ b/drivers/gpu/drm/drm_ras.c
> @@ -26,7 +26,7 @@
> * efficient lookup by ID. Nodes can be registered or unregistered
> * dynamically at runtime.
> *
> - * A Generic Netlink family `drm_ras` exposes two main operations to
> + * A Generic Netlink family `drm_ras` exposes the below operations to
> * userspace:
> *
> * 1. LIST_NODES: Dump all currently registered RAS nodes.
> @@ -37,6 +37,10 @@
> * Returns all counters of a node if only Node ID is provided or specific
> * error counters.
> *
> + * 3. CLEAR_ERROR_COUNTER: Clear error counter of a given node.
> + * Userspace must provide Node ID, Error ID.
> + * Clears specific error counter of a node if supported.
> + *
> * Node registration:
> *
> * - drm_ras_node_register(): Registers a new node and assigns
> @@ -66,6 +70,8 @@
> * operation, fetching all counters from a specific node.
> * - drm_ras_nl_get_error_counter_doit(): Implements the GET_ERROR_COUNTER doit
> * operation, fetching a counter value from a specific node.
> + * - drm_ras_nl_clear_error_counter_doit(): Implements the CLEAR_ERROR_COUNTER doit
> + * operation, clearing a counter value from a specific node.
> */
>
> static DEFINE_XARRAY_ALLOC(drm_ras_xa);
> @@ -314,6 +320,41 @@ int drm_ras_nl_get_error_counter_doit(struct sk_buff *skb,
> return doit_reply_value(info, node_id, error_id);
> }
>
> +/**
> + * drm_ras_nl_clear_error_counter_doit() - Clear an error counter of a node
> + * @skb: Netlink message buffer
> + * @info: Generic Netlink info containing attributes of the request
> + *
> + * Extracts the node ID and error ID from the netlink attributes and
> + * clears the current value.
> + *
> + * Return: 0 on success, or negative errno on failure.
> + */
> +int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
> + struct genl_info *info)
> +{
> + struct drm_ras_node *node;
> + u32 node_id, error_id;
> +
> + if (!info->attrs ||
> + GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID) ||
> + GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID))
> + return -EINVAL;
> +
> + node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
> + error_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]);
> +
> + node = xa_load(&drm_ras_xa, node_id);
> + if (!node || !node->clear_error_counter)
> + return -ENOENT;
> +
> + if (error_id < node->error_counter_range.first ||
> + error_id > node->error_counter_range.last)
> + return -EINVAL;
> +
> + return node->clear_error_counter(node, error_id);
> +}
> +
> /**
> * drm_ras_node_register() - Register a new RAS node
> * @node: Node structure to register
> diff --git a/drivers/gpu/drm/drm_ras_nl.c b/drivers/gpu/drm/drm_ras_nl.c
> index 16803d0c4a44..dea1c1b2494e 100644
> --- a/drivers/gpu/drm/drm_ras_nl.c
> +++ b/drivers/gpu/drm/drm_ras_nl.c
> @@ -22,6 +22,12 @@ static const struct nla_policy drm_ras_get_error_counter_dump_nl_policy[DRM_RAS_
> [DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
> };
>
> +/* DRM_RAS_CMD_CLEAR_ERROR_COUNTER - do */
> +static const struct nla_policy drm_ras_clear_error_counter_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID + 1] = {
> + [DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
> + [DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
> +};
> +
> /* Ops table for drm_ras */
> static const struct genl_split_ops drm_ras_nl_ops[] = {
> {
> @@ -43,6 +49,13 @@ static const struct genl_split_ops drm_ras_nl_ops[] = {
> .maxattr = DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID,
> .flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
> },
> + {
> + .cmd = DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
> + .doit = drm_ras_nl_clear_error_counter_doit,
> + .policy = drm_ras_clear_error_counter_nl_policy,
> + .maxattr = DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
> + .flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
> + },
> };
>
> struct genl_family drm_ras_nl_family __ro_after_init = {
> diff --git a/drivers/gpu/drm/drm_ras_nl.h b/drivers/gpu/drm/drm_ras_nl.h
> index 06ccd9342773..a398643572a5 100644
> --- a/drivers/gpu/drm/drm_ras_nl.h
> +++ b/drivers/gpu/drm/drm_ras_nl.h
> @@ -18,6 +18,8 @@ int drm_ras_nl_get_error_counter_doit(struct sk_buff *skb,
> struct genl_info *info);
> int drm_ras_nl_get_error_counter_dumpit(struct sk_buff *skb,
> struct netlink_callback *cb);
> +int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
> + struct genl_info *info);
>
> extern struct genl_family drm_ras_nl_family;
>
> diff --git a/include/drm/drm_ras.h b/include/drm/drm_ras.h
> index 5d50209e51db..f2a787bc4f64 100644
> --- a/include/drm/drm_ras.h
> +++ b/include/drm/drm_ras.h
> @@ -58,6 +58,17 @@ struct drm_ras_node {
> int (*query_error_counter)(struct drm_ras_node *node, u32 error_id,
> const char **name, u32 *val);
>
> + /**
> + * @clear_error_counter:
> + *
> + * This callback is used by drm_ras to clear a specific error counter.
> + * Driver should implement this callback to support clearing error counters
> + * of a node.
> + *
> + * Returns: 0 on success, negative error code on failure.
> + */
> + int (*clear_error_counter)(struct drm_ras_node *node, u32 error_id);
> +
> /** @priv: Driver private data */
> void *priv;
> };
> diff --git a/include/uapi/drm/drm_ras.h b/include/uapi/drm/drm_ras.h
> index 5f40fa5b869d..218a3ee86805 100644
> --- a/include/uapi/drm/drm_ras.h
> +++ b/include/uapi/drm/drm_ras.h
> @@ -41,6 +41,7 @@ enum {
> enum {
> DRM_RAS_CMD_LIST_NODES = 1,
> DRM_RAS_CMD_GET_ERROR_COUNTER,
> + DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
>
> __DRM_RAS_CMD_MAX,
> DRM_RAS_CMD_MAX = (__DRM_RAS_CMD_MAX - 1)
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v2 0/2] Add clear-error-counter command to drm_ras
@ 2026-04-09 7:33 Riana Tauro
2026-04-09 7:33 ` [PATCH v2 1/2] drm/drm_ras: Add clear-error-counter netlink " Riana Tauro
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Riana Tauro @ 2026-04-09 7:33 UTC (permalink / raw)
To: intel-xe, dri-devel, netdev
Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
raag.jadav, anvesh.bakwad, maarten.lankhorst, Riana Tauro
Add clear-error-counter command to drm_ras to clear a specific error
counter of a node. The request parameters for this command are
node-id and error-id and no response payload.
Implement the callback in XE driver to demonstrate usage.
Usage:
$ sudo ynl --family drm_ras --dump get-error-counter --json '{"node-id":1}'
[{'error-id': 1, 'error-name': 'core-compute', 'error-value': 0},
{'error-id': 2, 'error-name': 'soc-internal', 'error-value': 3}]
$ sudo ynl --family drm_ras --do clear-error-counter --json \
'{"node-id":1, "error-id":2}'
None
$ sudo ynl --family drm_ras --dump get-error-counter --json '{"node-id":1}'
[{'error-id': 1, 'error-name': 'core-compute', 'error-value': 0},
{'error-id': 2, 'error-name': 'soc-internal', 'error-value': 0}]
Rev2: Split patches
Riana Tauro (2):
drm/drm_ras: Add clear-error-counter netlink command to drm_ras
drm/xe/xe_drm_ras: Add support for clear-error-counter in XE drm_ras
Documentation/gpu/drm-ras.rst | 8 +++++
Documentation/netlink/specs/drm_ras.yaml | 13 ++++++-
drivers/gpu/drm/drm_ras.c | 43 +++++++++++++++++++++++-
drivers/gpu/drm/drm_ras_nl.c | 13 +++++++
drivers/gpu/drm/drm_ras_nl.h | 2 ++
drivers/gpu/drm/xe/xe_drm_ras.c | 35 +++++++++++++++++--
include/drm/drm_ras.h | 11 ++++++
include/uapi/drm/drm_ras.h | 1 +
8 files changed, 122 insertions(+), 4 deletions(-)
--
2.47.1
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v2 1/2] drm/drm_ras: Add clear-error-counter netlink command to drm_ras
2026-04-09 7:33 [PATCH v2 0/2] Add clear-error-counter command to drm_ras Riana Tauro
@ 2026-04-09 7:33 ` Riana Tauro
2026-04-09 7:21 ` Tauro, Riana
2026-04-12 1:34 ` Claude review: " Claude Code Review Bot
2026-04-09 7:33 ` [PATCH v2 2/2] drm/xe/xe_drm_ras: Add support for clear-error-counter in XE drm_ras Riana Tauro
2026-04-12 1:34 ` Claude review: Add clear-error-counter command to drm_ras Claude Code Review Bot
2 siblings, 2 replies; 12+ messages in thread
From: Riana Tauro @ 2026-04-09 7:33 UTC (permalink / raw)
To: intel-xe, dri-devel, netdev
Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
raag.jadav, anvesh.bakwad, maarten.lankhorst, Riana Tauro,
Jakub Kicinski, Zack McKevitt, Lijo Lazar, Hawking Zhang,
David S. Miller, Paolo Abeni, Eric Dumazet
Introduce a new 'clear-error-counter' drm_ras command to reset the counter
value for a specific error counter of a given node.
The command is a 'do' netlink request with 'node-id' and 'error-id'
as parameters with no response payload.
Usage:
$ sudo ynl --family drm_ras --do clear-error-counter --json \
'{"node-id":1, "error-id":1}'
None
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
Cc: Lijo Lazar <lijo.lazar@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Raag Jadav <raag.jadav@intel.com>
---
Documentation/gpu/drm-ras.rst | 8 +++++
Documentation/netlink/specs/drm_ras.yaml | 13 ++++++-
drivers/gpu/drm/drm_ras.c | 43 +++++++++++++++++++++++-
drivers/gpu/drm/drm_ras_nl.c | 13 +++++++
drivers/gpu/drm/drm_ras_nl.h | 2 ++
include/drm/drm_ras.h | 11 ++++++
include/uapi/drm/drm_ras.h | 1 +
7 files changed, 89 insertions(+), 2 deletions(-)
diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
index 70b246a78fc8..4636e68f5678 100644
--- a/Documentation/gpu/drm-ras.rst
+++ b/Documentation/gpu/drm-ras.rst
@@ -52,6 +52,8 @@ User space tools can:
as a parameter.
* Query specific error counter values with the ``get-error-counter`` command, using both
``node-id`` and ``error-id`` as parameters.
+* Clear specific error counters with the ``clear-error-counter`` command, using both
+ ``node-id`` and ``error-id`` as parameters.
YAML-based Interface
--------------------
@@ -101,3 +103,9 @@ Example: Query an error counter for a given node
sudo ynl --family drm_ras --do get-error-counter --json '{"node-id":0, "error-id":1}'
{'error-id': 1, 'error-name': 'error_name1', 'error-value': 0}
+Example: Clear an error counter for a given node
+
+.. code-block:: bash
+
+ sudo ynl --family drm_ras --do clear-error-counter --json '{"node-id":0, "error-id":1}'
+ None
diff --git a/Documentation/netlink/specs/drm_ras.yaml b/Documentation/netlink/specs/drm_ras.yaml
index 79af25dac3c5..e113056f8c01 100644
--- a/Documentation/netlink/specs/drm_ras.yaml
+++ b/Documentation/netlink/specs/drm_ras.yaml
@@ -99,7 +99,7 @@ operations:
flags: [admin-perm]
do:
request:
- attributes:
+ attributes: &id-attrs
- node-id
- error-id
reply:
@@ -113,3 +113,14 @@ operations:
- node-id
reply:
attributes: *errorinfo
+ -
+ name: clear-error-counter
+ doc: >-
+ Clear error counter for a given node.
+ The request includes the error-id and node-id of the
+ counter to be cleared.
+ attribute-set: error-counter-attrs
+ flags: [admin-perm]
+ do:
+ request:
+ attributes: *id-attrs
diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
index b2fa5ab86d87..d6eab29a1394 100644
--- a/drivers/gpu/drm/drm_ras.c
+++ b/drivers/gpu/drm/drm_ras.c
@@ -26,7 +26,7 @@
* efficient lookup by ID. Nodes can be registered or unregistered
* dynamically at runtime.
*
- * A Generic Netlink family `drm_ras` exposes two main operations to
+ * A Generic Netlink family `drm_ras` exposes the below operations to
* userspace:
*
* 1. LIST_NODES: Dump all currently registered RAS nodes.
@@ -37,6 +37,10 @@
* Returns all counters of a node if only Node ID is provided or specific
* error counters.
*
+ * 3. CLEAR_ERROR_COUNTER: Clear error counter of a given node.
+ * Userspace must provide Node ID, Error ID.
+ * Clears specific error counter of a node if supported.
+ *
* Node registration:
*
* - drm_ras_node_register(): Registers a new node and assigns
@@ -66,6 +70,8 @@
* operation, fetching all counters from a specific node.
* - drm_ras_nl_get_error_counter_doit(): Implements the GET_ERROR_COUNTER doit
* operation, fetching a counter value from a specific node.
+ * - drm_ras_nl_clear_error_counter_doit(): Implements the CLEAR_ERROR_COUNTER doit
+ * operation, clearing a counter value from a specific node.
*/
static DEFINE_XARRAY_ALLOC(drm_ras_xa);
@@ -314,6 +320,41 @@ int drm_ras_nl_get_error_counter_doit(struct sk_buff *skb,
return doit_reply_value(info, node_id, error_id);
}
+/**
+ * drm_ras_nl_clear_error_counter_doit() - Clear an error counter of a node
+ * @skb: Netlink message buffer
+ * @info: Generic Netlink info containing attributes of the request
+ *
+ * Extracts the node ID and error ID from the netlink attributes and
+ * clears the current value.
+ *
+ * Return: 0 on success, or negative errno on failure.
+ */
+int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
+ struct genl_info *info)
+{
+ struct drm_ras_node *node;
+ u32 node_id, error_id;
+
+ if (!info->attrs ||
+ GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID) ||
+ GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID))
+ return -EINVAL;
+
+ node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
+ error_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]);
+
+ node = xa_load(&drm_ras_xa, node_id);
+ if (!node || !node->clear_error_counter)
+ return -ENOENT;
+
+ if (error_id < node->error_counter_range.first ||
+ error_id > node->error_counter_range.last)
+ return -EINVAL;
+
+ return node->clear_error_counter(node, error_id);
+}
+
/**
* drm_ras_node_register() - Register a new RAS node
* @node: Node structure to register
diff --git a/drivers/gpu/drm/drm_ras_nl.c b/drivers/gpu/drm/drm_ras_nl.c
index 16803d0c4a44..dea1c1b2494e 100644
--- a/drivers/gpu/drm/drm_ras_nl.c
+++ b/drivers/gpu/drm/drm_ras_nl.c
@@ -22,6 +22,12 @@ static const struct nla_policy drm_ras_get_error_counter_dump_nl_policy[DRM_RAS_
[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
};
+/* DRM_RAS_CMD_CLEAR_ERROR_COUNTER - do */
+static const struct nla_policy drm_ras_clear_error_counter_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID + 1] = {
+ [DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
+ [DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
+};
+
/* Ops table for drm_ras */
static const struct genl_split_ops drm_ras_nl_ops[] = {
{
@@ -43,6 +49,13 @@ static const struct genl_split_ops drm_ras_nl_ops[] = {
.maxattr = DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID,
.flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
},
+ {
+ .cmd = DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
+ .doit = drm_ras_nl_clear_error_counter_doit,
+ .policy = drm_ras_clear_error_counter_nl_policy,
+ .maxattr = DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
+ .flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
+ },
};
struct genl_family drm_ras_nl_family __ro_after_init = {
diff --git a/drivers/gpu/drm/drm_ras_nl.h b/drivers/gpu/drm/drm_ras_nl.h
index 06ccd9342773..a398643572a5 100644
--- a/drivers/gpu/drm/drm_ras_nl.h
+++ b/drivers/gpu/drm/drm_ras_nl.h
@@ -18,6 +18,8 @@ int drm_ras_nl_get_error_counter_doit(struct sk_buff *skb,
struct genl_info *info);
int drm_ras_nl_get_error_counter_dumpit(struct sk_buff *skb,
struct netlink_callback *cb);
+int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
+ struct genl_info *info);
extern struct genl_family drm_ras_nl_family;
diff --git a/include/drm/drm_ras.h b/include/drm/drm_ras.h
index 5d50209e51db..f2a787bc4f64 100644
--- a/include/drm/drm_ras.h
+++ b/include/drm/drm_ras.h
@@ -58,6 +58,17 @@ struct drm_ras_node {
int (*query_error_counter)(struct drm_ras_node *node, u32 error_id,
const char **name, u32 *val);
+ /**
+ * @clear_error_counter:
+ *
+ * This callback is used by drm_ras to clear a specific error counter.
+ * Driver should implement this callback to support clearing error counters
+ * of a node.
+ *
+ * Returns: 0 on success, negative error code on failure.
+ */
+ int (*clear_error_counter)(struct drm_ras_node *node, u32 error_id);
+
/** @priv: Driver private data */
void *priv;
};
diff --git a/include/uapi/drm/drm_ras.h b/include/uapi/drm/drm_ras.h
index 5f40fa5b869d..218a3ee86805 100644
--- a/include/uapi/drm/drm_ras.h
+++ b/include/uapi/drm/drm_ras.h
@@ -41,6 +41,7 @@ enum {
enum {
DRM_RAS_CMD_LIST_NODES = 1,
DRM_RAS_CMD_GET_ERROR_COUNTER,
+ DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
__DRM_RAS_CMD_MAX,
DRM_RAS_CMD_MAX = (__DRM_RAS_CMD_MAX - 1)
--
2.47.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v2 2/2] drm/xe/xe_drm_ras: Add support for clear-error-counter in XE drm_ras
2026-04-09 7:33 [PATCH v2 0/2] Add clear-error-counter command to drm_ras Riana Tauro
2026-04-09 7:33 ` [PATCH v2 1/2] drm/drm_ras: Add clear-error-counter netlink " Riana Tauro
@ 2026-04-09 7:33 ` Riana Tauro
2026-04-12 1:34 ` Claude review: " Claude Code Review Bot
2026-04-12 1:34 ` Claude review: Add clear-error-counter command to drm_ras Claude Code Review Bot
2 siblings, 1 reply; 12+ messages in thread
From: Riana Tauro @ 2026-04-09 7:33 UTC (permalink / raw)
To: intel-xe, dri-devel, netdev
Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
raag.jadav, anvesh.bakwad, maarten.lankhorst, Riana Tauro
Add support for clear-error-counter command in XE drm_ras
This resets the counter value.
Usage:
$ sudo ynl --family drm_ras --do clear-error-counter --json \
'{"node-id":1, "error-id":1}'
None
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
Reviewed-by: Raag Jadav <raag.jadav@intel.com>
---
drivers/gpu/drm/xe/xe_drm_ras.c | 35 +++++++++++++++++++++++++++++++--
1 file changed, 33 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_drm_ras.c b/drivers/gpu/drm/xe/xe_drm_ras.c
index e07dc23a155e..c21c8b428de6 100644
--- a/drivers/gpu/drm/xe/xe_drm_ras.c
+++ b/drivers/gpu/drm/xe/xe_drm_ras.c
@@ -27,6 +27,16 @@ static int hw_query_error_counter(struct xe_drm_ras_counter *info,
return 0;
}
+static int hw_clear_error_counter(struct xe_drm_ras_counter *info, u32 error_id)
+{
+ if (!info || !info[error_id].name)
+ return -ENOENT;
+
+ atomic_set(&info[error_id].counter, 0);
+
+ return 0;
+}
+
static int query_uncorrectable_error_counter(struct drm_ras_node *ep, u32 error_id,
const char **name, u32 *val)
{
@@ -37,6 +47,15 @@ static int query_uncorrectable_error_counter(struct drm_ras_node *ep, u32 error_
return hw_query_error_counter(info, error_id, name, val);
}
+static int clear_uncorrectable_error_counter(struct drm_ras_node *node, u32 error_id)
+{
+ struct xe_device *xe = node->priv;
+ struct xe_drm_ras *ras = &xe->ras;
+ struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_UNCORRECTABLE];
+
+ return hw_clear_error_counter(info, error_id);
+}
+
static int query_correctable_error_counter(struct drm_ras_node *ep, u32 error_id,
const char **name, u32 *val)
{
@@ -47,6 +66,15 @@ static int query_correctable_error_counter(struct drm_ras_node *ep, u32 error_id
return hw_query_error_counter(info, error_id, name, val);
}
+static int clear_correctable_error_counter(struct drm_ras_node *node, u32 error_id)
+{
+ struct xe_device *xe = node->priv;
+ struct xe_drm_ras *ras = &xe->ras;
+ struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_CORRECTABLE];
+
+ return hw_clear_error_counter(info, error_id);
+}
+
static struct xe_drm_ras_counter *allocate_and_copy_counters(struct xe_device *xe)
{
struct xe_drm_ras_counter *counter;
@@ -92,10 +120,13 @@ static int assign_node_params(struct xe_device *xe, struct drm_ras_node *node,
if (IS_ERR(ras->info[severity]))
return PTR_ERR(ras->info[severity]);
- if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE)
+ if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE) {
node->query_error_counter = query_correctable_error_counter;
- else
+ node->clear_error_counter = clear_correctable_error_counter;
+ } else {
node->query_error_counter = query_uncorrectable_error_counter;
+ node->clear_error_counter = clear_uncorrectable_error_counter;
+ }
return 0;
}
--
2.47.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v2 1/2] drm/drm_ras: Add clear-error-counter netlink command to drm_ras
2026-04-09 7:21 ` Tauro, Riana
@ 2026-04-09 13:37 ` Rodrigo Vivi
2026-04-10 5:21 ` Tauro, Riana
2026-04-09 23:01 ` Zack McKevitt
1 sibling, 1 reply; 12+ messages in thread
From: Rodrigo Vivi @ 2026-04-09 13:37 UTC (permalink / raw)
To: Tauro, Riana
Cc: intel-xe, dri-devel, netdev, Zack McKevitt, joonas.lahtinen,
aravind.iddamsetty, anshuman.gupta, simona.vetter, airlied,
pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
shubham.kumar, ravi.kishore.koppuravuri, raag.jadav,
anvesh.bakwad, maarten.lankhorst, Jakub Kicinski, Lijo Lazar,
Hawking Zhang, David S. Miller, Paolo Abeni, Eric Dumazet
On Thu, Apr 09, 2026 at 12:51:44PM +0530, Tauro, Riana wrote:
> Hi Zack
>
> Could you please take a look at this patch if applicable to your usecase.
> Please let me know if any
> changes are required
>
> @Rodrigo This is already reviewed by Jakub and Raag.
> If there are no opens, can this be merged via drm_misc
if we push this to drm-misc-next, it might take a few weeks to propagate
back to drm-xe-next. With other work from you and Raag going fast pace
on drm-xe-next around this area, I'm afraid it could cause some conflicts.
It is definitely fine by me, but another option is to get ack from
drm-misc maintainers to get this through drm-xe-next.
so, really okay with drm-misc-next?
>
> Thanks
> Riana
>
> On 4/9/2026 1:03 PM, Riana Tauro wrote:
> > Introduce a new 'clear-error-counter' drm_ras command to reset the counter
> > value for a specific error counter of a given node.
> >
> > The command is a 'do' netlink request with 'node-id' and 'error-id'
> > as parameters with no response payload.
> >
> > Usage:
> >
> > $ sudo ynl --family drm_ras --do clear-error-counter --json \
> > '{"node-id":1, "error-id":1}'
> > None
> >
> > Cc: Jakub Kicinski <kuba@kernel.org>
> > Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
> > Cc: Lijo Lazar <lijo.lazar@amd.com>
> > Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> > Cc: David S. Miller <davem@davemloft.net>
> > Cc: Paolo Abeni <pabeni@redhat.com>
> > Cc: Eric Dumazet <edumazet@google.com>
> > Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> > Reviewed-by: Jakub Kicinski <kuba@kernel.org>
> > Reviewed-by: Raag Jadav <raag.jadav@intel.com>
> > ---
> > Documentation/gpu/drm-ras.rst | 8 +++++
> > Documentation/netlink/specs/drm_ras.yaml | 13 ++++++-
> > drivers/gpu/drm/drm_ras.c | 43 +++++++++++++++++++++++-
> > drivers/gpu/drm/drm_ras_nl.c | 13 +++++++
> > drivers/gpu/drm/drm_ras_nl.h | 2 ++
> > include/drm/drm_ras.h | 11 ++++++
> > include/uapi/drm/drm_ras.h | 1 +
> > 7 files changed, 89 insertions(+), 2 deletions(-)
> >
> > diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
> > index 70b246a78fc8..4636e68f5678 100644
> > --- a/Documentation/gpu/drm-ras.rst
> > +++ b/Documentation/gpu/drm-ras.rst
> > @@ -52,6 +52,8 @@ User space tools can:
> > as a parameter.
> > * Query specific error counter values with the ``get-error-counter`` command, using both
> > ``node-id`` and ``error-id`` as parameters.
> > +* Clear specific error counters with the ``clear-error-counter`` command, using both
> > + ``node-id`` and ``error-id`` as parameters.
> > YAML-based Interface
> > --------------------
> > @@ -101,3 +103,9 @@ Example: Query an error counter for a given node
> > sudo ynl --family drm_ras --do get-error-counter --json '{"node-id":0, "error-id":1}'
> > {'error-id': 1, 'error-name': 'error_name1', 'error-value': 0}
> > +Example: Clear an error counter for a given node
> > +
> > +.. code-block:: bash
> > +
> > + sudo ynl --family drm_ras --do clear-error-counter --json '{"node-id":0, "error-id":1}'
> > + None
> > diff --git a/Documentation/netlink/specs/drm_ras.yaml b/Documentation/netlink/specs/drm_ras.yaml
> > index 79af25dac3c5..e113056f8c01 100644
> > --- a/Documentation/netlink/specs/drm_ras.yaml
> > +++ b/Documentation/netlink/specs/drm_ras.yaml
> > @@ -99,7 +99,7 @@ operations:
> > flags: [admin-perm]
> > do:
> > request:
> > - attributes:
> > + attributes: &id-attrs
> > - node-id
> > - error-id
> > reply:
> > @@ -113,3 +113,14 @@ operations:
> > - node-id
> > reply:
> > attributes: *errorinfo
> > + -
> > + name: clear-error-counter
> > + doc: >-
> > + Clear error counter for a given node.
> > + The request includes the error-id and node-id of the
> > + counter to be cleared.
> > + attribute-set: error-counter-attrs
> > + flags: [admin-perm]
> > + do:
> > + request:
> > + attributes: *id-attrs
> > diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
> > index b2fa5ab86d87..d6eab29a1394 100644
> > --- a/drivers/gpu/drm/drm_ras.c
> > +++ b/drivers/gpu/drm/drm_ras.c
> > @@ -26,7 +26,7 @@
> > * efficient lookup by ID. Nodes can be registered or unregistered
> > * dynamically at runtime.
> > *
> > - * A Generic Netlink family `drm_ras` exposes two main operations to
> > + * A Generic Netlink family `drm_ras` exposes the below operations to
> > * userspace:
> > *
> > * 1. LIST_NODES: Dump all currently registered RAS nodes.
> > @@ -37,6 +37,10 @@
> > * Returns all counters of a node if only Node ID is provided or specific
> > * error counters.
> > *
> > + * 3. CLEAR_ERROR_COUNTER: Clear error counter of a given node.
> > + * Userspace must provide Node ID, Error ID.
> > + * Clears specific error counter of a node if supported.
> > + *
> > * Node registration:
> > *
> > * - drm_ras_node_register(): Registers a new node and assigns
> > @@ -66,6 +70,8 @@
> > * operation, fetching all counters from a specific node.
> > * - drm_ras_nl_get_error_counter_doit(): Implements the GET_ERROR_COUNTER doit
> > * operation, fetching a counter value from a specific node.
> > + * - drm_ras_nl_clear_error_counter_doit(): Implements the CLEAR_ERROR_COUNTER doit
> > + * operation, clearing a counter value from a specific node.
> > */
> > static DEFINE_XARRAY_ALLOC(drm_ras_xa);
> > @@ -314,6 +320,41 @@ int drm_ras_nl_get_error_counter_doit(struct sk_buff *skb,
> > return doit_reply_value(info, node_id, error_id);
> > }
> > +/**
> > + * drm_ras_nl_clear_error_counter_doit() - Clear an error counter of a node
> > + * @skb: Netlink message buffer
> > + * @info: Generic Netlink info containing attributes of the request
> > + *
> > + * Extracts the node ID and error ID from the netlink attributes and
> > + * clears the current value.
> > + *
> > + * Return: 0 on success, or negative errno on failure.
> > + */
> > +int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
> > + struct genl_info *info)
> > +{
> > + struct drm_ras_node *node;
> > + u32 node_id, error_id;
> > +
> > + if (!info->attrs ||
> > + GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID) ||
> > + GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID))
> > + return -EINVAL;
> > +
> > + node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
> > + error_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]);
> > +
> > + node = xa_load(&drm_ras_xa, node_id);
> > + if (!node || !node->clear_error_counter)
> > + return -ENOENT;
> > +
> > + if (error_id < node->error_counter_range.first ||
> > + error_id > node->error_counter_range.last)
> > + return -EINVAL;
> > +
> > + return node->clear_error_counter(node, error_id);
> > +}
> > +
> > /**
> > * drm_ras_node_register() - Register a new RAS node
> > * @node: Node structure to register
> > diff --git a/drivers/gpu/drm/drm_ras_nl.c b/drivers/gpu/drm/drm_ras_nl.c
> > index 16803d0c4a44..dea1c1b2494e 100644
> > --- a/drivers/gpu/drm/drm_ras_nl.c
> > +++ b/drivers/gpu/drm/drm_ras_nl.c
> > @@ -22,6 +22,12 @@ static const struct nla_policy drm_ras_get_error_counter_dump_nl_policy[DRM_RAS_
> > [DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
> > };
> > +/* DRM_RAS_CMD_CLEAR_ERROR_COUNTER - do */
> > +static const struct nla_policy drm_ras_clear_error_counter_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID + 1] = {
> > + [DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
> > + [DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
> > +};
> > +
> > /* Ops table for drm_ras */
> > static const struct genl_split_ops drm_ras_nl_ops[] = {
> > {
> > @@ -43,6 +49,13 @@ static const struct genl_split_ops drm_ras_nl_ops[] = {
> > .maxattr = DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID,
> > .flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
> > },
> > + {
> > + .cmd = DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
> > + .doit = drm_ras_nl_clear_error_counter_doit,
> > + .policy = drm_ras_clear_error_counter_nl_policy,
> > + .maxattr = DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
> > + .flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
> > + },
> > };
> > struct genl_family drm_ras_nl_family __ro_after_init = {
> > diff --git a/drivers/gpu/drm/drm_ras_nl.h b/drivers/gpu/drm/drm_ras_nl.h
> > index 06ccd9342773..a398643572a5 100644
> > --- a/drivers/gpu/drm/drm_ras_nl.h
> > +++ b/drivers/gpu/drm/drm_ras_nl.h
> > @@ -18,6 +18,8 @@ int drm_ras_nl_get_error_counter_doit(struct sk_buff *skb,
> > struct genl_info *info);
> > int drm_ras_nl_get_error_counter_dumpit(struct sk_buff *skb,
> > struct netlink_callback *cb);
> > +int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
> > + struct genl_info *info);
> > extern struct genl_family drm_ras_nl_family;
> > diff --git a/include/drm/drm_ras.h b/include/drm/drm_ras.h
> > index 5d50209e51db..f2a787bc4f64 100644
> > --- a/include/drm/drm_ras.h
> > +++ b/include/drm/drm_ras.h
> > @@ -58,6 +58,17 @@ struct drm_ras_node {
> > int (*query_error_counter)(struct drm_ras_node *node, u32 error_id,
> > const char **name, u32 *val);
> > + /**
> > + * @clear_error_counter:
> > + *
> > + * This callback is used by drm_ras to clear a specific error counter.
> > + * Driver should implement this callback to support clearing error counters
> > + * of a node.
> > + *
> > + * Returns: 0 on success, negative error code on failure.
> > + */
> > + int (*clear_error_counter)(struct drm_ras_node *node, u32 error_id);
> > +
> > /** @priv: Driver private data */
> > void *priv;
> > };
> > diff --git a/include/uapi/drm/drm_ras.h b/include/uapi/drm/drm_ras.h
> > index 5f40fa5b869d..218a3ee86805 100644
> > --- a/include/uapi/drm/drm_ras.h
> > +++ b/include/uapi/drm/drm_ras.h
> > @@ -41,6 +41,7 @@ enum {
> > enum {
> > DRM_RAS_CMD_LIST_NODES = 1,
> > DRM_RAS_CMD_GET_ERROR_COUNTER,
> > + DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
> > __DRM_RAS_CMD_MAX,
> > DRM_RAS_CMD_MAX = (__DRM_RAS_CMD_MAX - 1)
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 1/2] drm/drm_ras: Add clear-error-counter netlink command to drm_ras
2026-04-09 7:21 ` Tauro, Riana
2026-04-09 13:37 ` Rodrigo Vivi
@ 2026-04-09 23:01 ` Zack McKevitt
2026-04-10 5:25 ` Tauro, Riana
1 sibling, 1 reply; 12+ messages in thread
From: Zack McKevitt @ 2026-04-09 23:01 UTC (permalink / raw)
To: Tauro, Riana, intel-xe, dri-devel, netdev, rodrigo.vivi,
joonas.lahtinen, aravind.iddamsetty
Cc: anshuman.gupta, simona.vetter, airlied, pratik.bari,
joshua.santosh.ranjan, ashwin.kumar.kulkarni, shubham.kumar,
ravi.kishore.koppuravuri, raag.jadav, anvesh.bakwad,
maarten.lankhorst, Jakub Kicinski, Lijo Lazar, Hawking Zhang,
David S. Miller, Paolo Abeni, Eric Dumazet
On 4/9/2026 1:21 AM, Tauro, Riana wrote:
> Hi Zack
>
> Could you please take a look at this patch if applicable to your
> usecase. Please let me know if any
> changes are required
>
From a quick glance, I think this looks good from our end.
Thanks,
Zack
> @Rodrigo This is already reviewed by Jakub and Raag.
> If there are no opens, can this be merged via drm_misc
>
> Thanks
> Riana
>
> On 4/9/2026 1:03 PM, Riana Tauro wrote:
>> Introduce a new 'clear-error-counter' drm_ras command to reset the
>> counter
>> value for a specific error counter of a given node.
>>
>> The command is a 'do' netlink request with 'node-id' and 'error-id'
>> as parameters with no response payload.
>>
>> Usage:
>>
>> $ sudo ynl --family drm_ras --do clear-error-counter --json \
>> '{"node-id":1, "error-id":1}'
>> None
>>
>> Cc: Jakub Kicinski <kuba@kernel.org>
>> Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
>> Cc: Lijo Lazar <lijo.lazar@amd.com>
>> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
>> Cc: David S. Miller <davem@davemloft.net>
>> Cc: Paolo Abeni <pabeni@redhat.com>
>> Cc: Eric Dumazet <edumazet@google.com>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> Reviewed-by: Jakub Kicinski <kuba@kernel.org>
>> Reviewed-by: Raag Jadav <raag.jadav@intel.com>
>> ---
>> Documentation/gpu/drm-ras.rst | 8 +++++
>> Documentation/netlink/specs/drm_ras.yaml | 13 ++++++-
>> drivers/gpu/drm/drm_ras.c | 43 +++++++++++++++++++++++-
>> drivers/gpu/drm/drm_ras_nl.c | 13 +++++++
>> drivers/gpu/drm/drm_ras_nl.h | 2 ++
>> include/drm/drm_ras.h | 11 ++++++
>> include/uapi/drm/drm_ras.h | 1 +
>> 7 files changed, 89 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-
>> ras.rst
>> index 70b246a78fc8..4636e68f5678 100644
>> --- a/Documentation/gpu/drm-ras.rst
>> +++ b/Documentation/gpu/drm-ras.rst
>> @@ -52,6 +52,8 @@ User space tools can:
>> as a parameter.
>> * Query specific error counter values with the ``get-error-counter``
>> command, using both
>> ``node-id`` and ``error-id`` as parameters.
>> +* Clear specific error counters with the ``clear-error-counter``
>> command, using both
>> + ``node-id`` and ``error-id`` as parameters.
>> YAML-based Interface
>> --------------------
>> @@ -101,3 +103,9 @@ Example: Query an error counter for a given node
>> sudo ynl --family drm_ras --do get-error-counter --json '{"node-
>> id":0, "error-id":1}'
>> {'error-id': 1, 'error-name': 'error_name1', 'error-value': 0}
>> +Example: Clear an error counter for a given node
>> +
>> +.. code-block:: bash
>> +
>> + sudo ynl --family drm_ras --do clear-error-counter --json
>> '{"node-id":0, "error-id":1}'
>> + None
>> diff --git a/Documentation/netlink/specs/drm_ras.yaml b/Documentation/
>> netlink/specs/drm_ras.yaml
>> index 79af25dac3c5..e113056f8c01 100644
>> --- a/Documentation/netlink/specs/drm_ras.yaml
>> +++ b/Documentation/netlink/specs/drm_ras.yaml
>> @@ -99,7 +99,7 @@ operations:
>> flags: [admin-perm]
>> do:
>> request:
>> - attributes:
>> + attributes: &id-attrs
>> - node-id
>> - error-id
>> reply:
>> @@ -113,3 +113,14 @@ operations:
>> - node-id
>> reply:
>> attributes: *errorinfo
>> + -
>> + name: clear-error-counter
>> + doc: >-
>> + Clear error counter for a given node.
>> + The request includes the error-id and node-id of the
>> + counter to be cleared.
>> + attribute-set: error-counter-attrs
>> + flags: [admin-perm]
>> + do:
>> + request:
>> + attributes: *id-attrs
>> diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
>> index b2fa5ab86d87..d6eab29a1394 100644
>> --- a/drivers/gpu/drm/drm_ras.c
>> +++ b/drivers/gpu/drm/drm_ras.c
>> @@ -26,7 +26,7 @@
>> * efficient lookup by ID. Nodes can be registered or unregistered
>> * dynamically at runtime.
>> *
>> - * A Generic Netlink family `drm_ras` exposes two main operations to
>> + * A Generic Netlink family `drm_ras` exposes the below operations to
>> * userspace:
>> *
>> * 1. LIST_NODES: Dump all currently registered RAS nodes.
>> @@ -37,6 +37,10 @@
>> * Returns all counters of a node if only Node ID is provided or
>> specific
>> * error counters.
>> *
>> + * 3. CLEAR_ERROR_COUNTER: Clear error counter of a given node.
>> + * Userspace must provide Node ID, Error ID.
>> + * Clears specific error counter of a node if supported.
>> + *
>> * Node registration:
>> *
>> * - drm_ras_node_register(): Registers a new node and assigns
>> @@ -66,6 +70,8 @@
>> * operation, fetching all counters from a specific node.
>> * - drm_ras_nl_get_error_counter_doit(): Implements the
>> GET_ERROR_COUNTER doit
>> * operation, fetching a counter value from a specific node.
>> + * - drm_ras_nl_clear_error_counter_doit(): Implements the
>> CLEAR_ERROR_COUNTER doit
>> + * operation, clearing a counter value from a specific node.
>> */
>> static DEFINE_XARRAY_ALLOC(drm_ras_xa);
>> @@ -314,6 +320,41 @@ int drm_ras_nl_get_error_counter_doit(struct
>> sk_buff *skb,
>> return doit_reply_value(info, node_id, error_id);
>> }
>> +/**
>> + * drm_ras_nl_clear_error_counter_doit() - Clear an error counter of
>> a node
>> + * @skb: Netlink message buffer
>> + * @info: Generic Netlink info containing attributes of the request
>> + *
>> + * Extracts the node ID and error ID from the netlink attributes and
>> + * clears the current value.
>> + *
>> + * Return: 0 on success, or negative errno on failure.
>> + */
>> +int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
>> + struct genl_info *info)
>> +{
>> + struct drm_ras_node *node;
>> + u32 node_id, error_id;
>> +
>> + if (!info->attrs ||
>> + GENL_REQ_ATTR_CHECK(info,
>> DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID) ||
>> + GENL_REQ_ATTR_CHECK(info,
>> DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID))
>> + return -EINVAL;
>> +
>> + node_id = nla_get_u32(info-
>> >attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
>> + error_id = nla_get_u32(info-
>> >attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]);
>> +
>> + node = xa_load(&drm_ras_xa, node_id);
>> + if (!node || !node->clear_error_counter)
>> + return -ENOENT;
>> +
>> + if (error_id < node->error_counter_range.first ||
>> + error_id > node->error_counter_range.last)
>> + return -EINVAL;
>> +
>> + return node->clear_error_counter(node, error_id);
>> +}
>> +
>> /**
>> * drm_ras_node_register() - Register a new RAS node
>> * @node: Node structure to register
>> diff --git a/drivers/gpu/drm/drm_ras_nl.c b/drivers/gpu/drm/drm_ras_nl.c
>> index 16803d0c4a44..dea1c1b2494e 100644
>> --- a/drivers/gpu/drm/drm_ras_nl.c
>> +++ b/drivers/gpu/drm/drm_ras_nl.c
>> @@ -22,6 +22,12 @@ static const struct nla_policy
>> drm_ras_get_error_counter_dump_nl_policy[DRM_RAS_
>> [DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
>> };
>> +/* DRM_RAS_CMD_CLEAR_ERROR_COUNTER - do */
>> +static const struct nla_policy
>> drm_ras_clear_error_counter_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID + 1] = {
>> + [DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
>> + [DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
>> +};
>> +
>> /* Ops table for drm_ras */
>> static const struct genl_split_ops drm_ras_nl_ops[] = {
>> {
>> @@ -43,6 +49,13 @@ static const struct genl_split_ops drm_ras_nl_ops[]
>> = {
>> .maxattr = DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID,
>> .flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
>> },
>> + {
>> + .cmd = DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
>> + .doit = drm_ras_nl_clear_error_counter_doit,
>> + .policy = drm_ras_clear_error_counter_nl_policy,
>> + .maxattr = DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
>> + .flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
>> + },
>> };
>> struct genl_family drm_ras_nl_family __ro_after_init = {
>> diff --git a/drivers/gpu/drm/drm_ras_nl.h b/drivers/gpu/drm/drm_ras_nl.h
>> index 06ccd9342773..a398643572a5 100644
>> --- a/drivers/gpu/drm/drm_ras_nl.h
>> +++ b/drivers/gpu/drm/drm_ras_nl.h
>> @@ -18,6 +18,8 @@ int drm_ras_nl_get_error_counter_doit(struct sk_buff
>> *skb,
>> struct genl_info *info);
>> int drm_ras_nl_get_error_counter_dumpit(struct sk_buff *skb,
>> struct netlink_callback *cb);
>> +int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
>> + struct genl_info *info);
>> extern struct genl_family drm_ras_nl_family;
>> diff --git a/include/drm/drm_ras.h b/include/drm/drm_ras.h
>> index 5d50209e51db..f2a787bc4f64 100644
>> --- a/include/drm/drm_ras.h
>> +++ b/include/drm/drm_ras.h
>> @@ -58,6 +58,17 @@ struct drm_ras_node {
>> int (*query_error_counter)(struct drm_ras_node *node, u32 error_id,
>> const char **name, u32 *val);
>> + /**
>> + * @clear_error_counter:
>> + *
>> + * This callback is used by drm_ras to clear a specific error
>> counter.
>> + * Driver should implement this callback to support clearing
>> error counters
>> + * of a node.
>> + *
>> + * Returns: 0 on success, negative error code on failure.
>> + */
>> + int (*clear_error_counter)(struct drm_ras_node *node, u32 error_id);
>> +
>> /** @priv: Driver private data */
>> void *priv;
>> };
>> diff --git a/include/uapi/drm/drm_ras.h b/include/uapi/drm/drm_ras.h
>> index 5f40fa5b869d..218a3ee86805 100644
>> --- a/include/uapi/drm/drm_ras.h
>> +++ b/include/uapi/drm/drm_ras.h
>> @@ -41,6 +41,7 @@ enum {
>> enum {
>> DRM_RAS_CMD_LIST_NODES = 1,
>> DRM_RAS_CMD_GET_ERROR_COUNTER,
>> + DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
>> __DRM_RAS_CMD_MAX,
>> DRM_RAS_CMD_MAX = (__DRM_RAS_CMD_MAX - 1)
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 1/2] drm/drm_ras: Add clear-error-counter netlink command to drm_ras
2026-04-09 13:37 ` Rodrigo Vivi
@ 2026-04-10 5:21 ` Tauro, Riana
0 siblings, 0 replies; 12+ messages in thread
From: Tauro, Riana @ 2026-04-10 5:21 UTC (permalink / raw)
To: Rodrigo Vivi, maarten.lankhorst
Cc: intel-xe, dri-devel, netdev, Zack McKevitt, joonas.lahtinen,
aravind.iddamsetty, anshuman.gupta, simona.vetter, airlied,
pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
shubham.kumar, ravi.kishore.koppuravuri, raag.jadav,
anvesh.bakwad, Jakub Kicinski, Lijo Lazar, Hawking Zhang,
David S. Miller, Paolo Abeni, Eric Dumazet
Hi Rodrigo
On 4/9/2026 7:07 PM, Rodrigo Vivi wrote:
> On Thu, Apr 09, 2026 at 12:51:44PM +0530, Tauro, Riana wrote:
>> Hi Zack
>>
>> Could you please take a look at this patch if applicable to your usecase.
>> Please let me know if any
>> changes are required
>>
>> @Rodrigo This is already reviewed by Jakub and Raag.
>> If there are no opens, can this be merged via drm_misc
> if we push this to drm-misc-next, it might take a few weeks to propagate
> back to drm-xe-next. With other work from you and Raag going fast pace
> on drm-xe-next around this area, I'm afraid it could cause some conflicts.
>
> It is definitely fine by me, but another option is to get ack from
> drm-misc maintainers to get this through drm-xe-next.
>
Yeah this would be better with the other RAS patches close to merge.
@Maarten Can you please help with an ack if this patch looks good to you?
This has been reviewed by Jakub from netdev and Raag from intel-xe
There are no other opens.
Thanks
Riana
>
> so, really okay with drm-misc-next?
>
>> Thanks
>> Riana
>>
>> On 4/9/2026 1:03 PM, Riana Tauro wrote:
>>> Introduce a new 'clear-error-counter' drm_ras command to reset the counter
>>> value for a specific error counter of a given node.
>>>
>>> The command is a 'do' netlink request with 'node-id' and 'error-id'
>>> as parameters with no response payload.
>>>
>>> Usage:
>>>
>>> $ sudo ynl --family drm_ras --do clear-error-counter --json \
>>> '{"node-id":1, "error-id":1}'
>>> None
>>>
>>> Cc: Jakub Kicinski <kuba@kernel.org>
>>> Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
>>> Cc: Lijo Lazar <lijo.lazar@amd.com>
>>> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
>>> Cc: David S. Miller <davem@davemloft.net>
>>> Cc: Paolo Abeni <pabeni@redhat.com>
>>> Cc: Eric Dumazet <edumazet@google.com>
>>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>>> Reviewed-by: Jakub Kicinski <kuba@kernel.org>
>>> Reviewed-by: Raag Jadav <raag.jadav@intel.com>
>>> ---
>>> Documentation/gpu/drm-ras.rst | 8 +++++
>>> Documentation/netlink/specs/drm_ras.yaml | 13 ++++++-
>>> drivers/gpu/drm/drm_ras.c | 43 +++++++++++++++++++++++-
>>> drivers/gpu/drm/drm_ras_nl.c | 13 +++++++
>>> drivers/gpu/drm/drm_ras_nl.h | 2 ++
>>> include/drm/drm_ras.h | 11 ++++++
>>> include/uapi/drm/drm_ras.h | 1 +
>>> 7 files changed, 89 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
>>> index 70b246a78fc8..4636e68f5678 100644
>>> --- a/Documentation/gpu/drm-ras.rst
>>> +++ b/Documentation/gpu/drm-ras.rst
>>> @@ -52,6 +52,8 @@ User space tools can:
>>> as a parameter.
>>> * Query specific error counter values with the ``get-error-counter`` command, using both
>>> ``node-id`` and ``error-id`` as parameters.
>>> +* Clear specific error counters with the ``clear-error-counter`` command, using both
>>> + ``node-id`` and ``error-id`` as parameters.
>>> YAML-based Interface
>>> --------------------
>>> @@ -101,3 +103,9 @@ Example: Query an error counter for a given node
>>> sudo ynl --family drm_ras --do get-error-counter --json '{"node-id":0, "error-id":1}'
>>> {'error-id': 1, 'error-name': 'error_name1', 'error-value': 0}
>>> +Example: Clear an error counter for a given node
>>> +
>>> +.. code-block:: bash
>>> +
>>> + sudo ynl --family drm_ras --do clear-error-counter --json '{"node-id":0, "error-id":1}'
>>> + None
>>> diff --git a/Documentation/netlink/specs/drm_ras.yaml b/Documentation/netlink/specs/drm_ras.yaml
>>> index 79af25dac3c5..e113056f8c01 100644
>>> --- a/Documentation/netlink/specs/drm_ras.yaml
>>> +++ b/Documentation/netlink/specs/drm_ras.yaml
>>> @@ -99,7 +99,7 @@ operations:
>>> flags: [admin-perm]
>>> do:
>>> request:
>>> - attributes:
>>> + attributes: &id-attrs
>>> - node-id
>>> - error-id
>>> reply:
>>> @@ -113,3 +113,14 @@ operations:
>>> - node-id
>>> reply:
>>> attributes: *errorinfo
>>> + -
>>> + name: clear-error-counter
>>> + doc: >-
>>> + Clear error counter for a given node.
>>> + The request includes the error-id and node-id of the
>>> + counter to be cleared.
>>> + attribute-set: error-counter-attrs
>>> + flags: [admin-perm]
>>> + do:
>>> + request:
>>> + attributes: *id-attrs
>>> diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
>>> index b2fa5ab86d87..d6eab29a1394 100644
>>> --- a/drivers/gpu/drm/drm_ras.c
>>> +++ b/drivers/gpu/drm/drm_ras.c
>>> @@ -26,7 +26,7 @@
>>> * efficient lookup by ID. Nodes can be registered or unregistered
>>> * dynamically at runtime.
>>> *
>>> - * A Generic Netlink family `drm_ras` exposes two main operations to
>>> + * A Generic Netlink family `drm_ras` exposes the below operations to
>>> * userspace:
>>> *
>>> * 1. LIST_NODES: Dump all currently registered RAS nodes.
>>> @@ -37,6 +37,10 @@
>>> * Returns all counters of a node if only Node ID is provided or specific
>>> * error counters.
>>> *
>>> + * 3. CLEAR_ERROR_COUNTER: Clear error counter of a given node.
>>> + * Userspace must provide Node ID, Error ID.
>>> + * Clears specific error counter of a node if supported.
>>> + *
>>> * Node registration:
>>> *
>>> * - drm_ras_node_register(): Registers a new node and assigns
>>> @@ -66,6 +70,8 @@
>>> * operation, fetching all counters from a specific node.
>>> * - drm_ras_nl_get_error_counter_doit(): Implements the GET_ERROR_COUNTER doit
>>> * operation, fetching a counter value from a specific node.
>>> + * - drm_ras_nl_clear_error_counter_doit(): Implements the CLEAR_ERROR_COUNTER doit
>>> + * operation, clearing a counter value from a specific node.
>>> */
>>> static DEFINE_XARRAY_ALLOC(drm_ras_xa);
>>> @@ -314,6 +320,41 @@ int drm_ras_nl_get_error_counter_doit(struct sk_buff *skb,
>>> return doit_reply_value(info, node_id, error_id);
>>> }
>>> +/**
>>> + * drm_ras_nl_clear_error_counter_doit() - Clear an error counter of a node
>>> + * @skb: Netlink message buffer
>>> + * @info: Generic Netlink info containing attributes of the request
>>> + *
>>> + * Extracts the node ID and error ID from the netlink attributes and
>>> + * clears the current value.
>>> + *
>>> + * Return: 0 on success, or negative errno on failure.
>>> + */
>>> +int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
>>> + struct genl_info *info)
>>> +{
>>> + struct drm_ras_node *node;
>>> + u32 node_id, error_id;
>>> +
>>> + if (!info->attrs ||
>>> + GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID) ||
>>> + GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID))
>>> + return -EINVAL;
>>> +
>>> + node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
>>> + error_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]);
>>> +
>>> + node = xa_load(&drm_ras_xa, node_id);
>>> + if (!node || !node->clear_error_counter)
>>> + return -ENOENT;
>>> +
>>> + if (error_id < node->error_counter_range.first ||
>>> + error_id > node->error_counter_range.last)
>>> + return -EINVAL;
>>> +
>>> + return node->clear_error_counter(node, error_id);
>>> +}
>>> +
>>> /**
>>> * drm_ras_node_register() - Register a new RAS node
>>> * @node: Node structure to register
>>> diff --git a/drivers/gpu/drm/drm_ras_nl.c b/drivers/gpu/drm/drm_ras_nl.c
>>> index 16803d0c4a44..dea1c1b2494e 100644
>>> --- a/drivers/gpu/drm/drm_ras_nl.c
>>> +++ b/drivers/gpu/drm/drm_ras_nl.c
>>> @@ -22,6 +22,12 @@ static const struct nla_policy drm_ras_get_error_counter_dump_nl_policy[DRM_RAS_
>>> [DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
>>> };
>>> +/* DRM_RAS_CMD_CLEAR_ERROR_COUNTER - do */
>>> +static const struct nla_policy drm_ras_clear_error_counter_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID + 1] = {
>>> + [DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
>>> + [DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
>>> +};
>>> +
>>> /* Ops table for drm_ras */
>>> static const struct genl_split_ops drm_ras_nl_ops[] = {
>>> {
>>> @@ -43,6 +49,13 @@ static const struct genl_split_ops drm_ras_nl_ops[] = {
>>> .maxattr = DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID,
>>> .flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
>>> },
>>> + {
>>> + .cmd = DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
>>> + .doit = drm_ras_nl_clear_error_counter_doit,
>>> + .policy = drm_ras_clear_error_counter_nl_policy,
>>> + .maxattr = DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
>>> + .flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
>>> + },
>>> };
>>> struct genl_family drm_ras_nl_family __ro_after_init = {
>>> diff --git a/drivers/gpu/drm/drm_ras_nl.h b/drivers/gpu/drm/drm_ras_nl.h
>>> index 06ccd9342773..a398643572a5 100644
>>> --- a/drivers/gpu/drm/drm_ras_nl.h
>>> +++ b/drivers/gpu/drm/drm_ras_nl.h
>>> @@ -18,6 +18,8 @@ int drm_ras_nl_get_error_counter_doit(struct sk_buff *skb,
>>> struct genl_info *info);
>>> int drm_ras_nl_get_error_counter_dumpit(struct sk_buff *skb,
>>> struct netlink_callback *cb);
>>> +int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
>>> + struct genl_info *info);
>>> extern struct genl_family drm_ras_nl_family;
>>> diff --git a/include/drm/drm_ras.h b/include/drm/drm_ras.h
>>> index 5d50209e51db..f2a787bc4f64 100644
>>> --- a/include/drm/drm_ras.h
>>> +++ b/include/drm/drm_ras.h
>>> @@ -58,6 +58,17 @@ struct drm_ras_node {
>>> int (*query_error_counter)(struct drm_ras_node *node, u32 error_id,
>>> const char **name, u32 *val);
>>> + /**
>>> + * @clear_error_counter:
>>> + *
>>> + * This callback is used by drm_ras to clear a specific error counter.
>>> + * Driver should implement this callback to support clearing error counters
>>> + * of a node.
>>> + *
>>> + * Returns: 0 on success, negative error code on failure.
>>> + */
>>> + int (*clear_error_counter)(struct drm_ras_node *node, u32 error_id);
>>> +
>>> /** @priv: Driver private data */
>>> void *priv;
>>> };
>>> diff --git a/include/uapi/drm/drm_ras.h b/include/uapi/drm/drm_ras.h
>>> index 5f40fa5b869d..218a3ee86805 100644
>>> --- a/include/uapi/drm/drm_ras.h
>>> +++ b/include/uapi/drm/drm_ras.h
>>> @@ -41,6 +41,7 @@ enum {
>>> enum {
>>> DRM_RAS_CMD_LIST_NODES = 1,
>>> DRM_RAS_CMD_GET_ERROR_COUNTER,
>>> + DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
>>> __DRM_RAS_CMD_MAX,
>>> DRM_RAS_CMD_MAX = (__DRM_RAS_CMD_MAX - 1)
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 1/2] drm/drm_ras: Add clear-error-counter netlink command to drm_ras
2026-04-09 23:01 ` Zack McKevitt
@ 2026-04-10 5:25 ` Tauro, Riana
0 siblings, 0 replies; 12+ messages in thread
From: Tauro, Riana @ 2026-04-10 5:25 UTC (permalink / raw)
To: Zack McKevitt, intel-xe, dri-devel, netdev, rodrigo.vivi,
joonas.lahtinen, aravind.iddamsetty
Cc: anshuman.gupta, simona.vetter, airlied, pratik.bari,
joshua.santosh.ranjan, ashwin.kumar.kulkarni, shubham.kumar,
ravi.kishore.koppuravuri, raag.jadav, anvesh.bakwad,
maarten.lankhorst, Jakub Kicinski, Lijo Lazar, Hawking Zhang,
David S. Miller, Paolo Abeni, Eric Dumazet
On 4/10/2026 4:31 AM, Zack McKevitt wrote:
>
> On 4/9/2026 1:21 AM, Tauro, Riana wrote:
>> Hi Zack
>>
>> Could you please take a look at this patch if applicable to your
>> usecase. Please let me know if any
>> changes are required
>>
>
> From a quick glance, I think this looks good from our end.
Thank you Zack for taking a look.
>
> Thanks,
> Zack
>
>> @Rodrigo This is already reviewed by Jakub and Raag.
>> If there are no opens, can this be merged via drm_misc
>>
>> Thanks
>> Riana
>>
>> On 4/9/2026 1:03 PM, Riana Tauro wrote:
>>> Introduce a new 'clear-error-counter' drm_ras command to reset the
>>> counter
>>> value for a specific error counter of a given node.
>>>
>>> The command is a 'do' netlink request with 'node-id' and 'error-id'
>>> as parameters with no response payload.
>>>
>>> Usage:
>>>
>>> $ sudo ynl --family drm_ras --do clear-error-counter --json \
>>> '{"node-id":1, "error-id":1}'
>>> None
>>>
>>> Cc: Jakub Kicinski <kuba@kernel.org>
>>> Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
>>> Cc: Lijo Lazar <lijo.lazar@amd.com>
>>> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
>>> Cc: David S. Miller <davem@davemloft.net>
>>> Cc: Paolo Abeni <pabeni@redhat.com>
>>> Cc: Eric Dumazet <edumazet@google.com>
>>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>>> Reviewed-by: Jakub Kicinski <kuba@kernel.org>
>>> Reviewed-by: Raag Jadav <raag.jadav@intel.com>
>>> ---
>>> Documentation/gpu/drm-ras.rst | 8 +++++
>>> Documentation/netlink/specs/drm_ras.yaml | 13 ++++++-
>>> drivers/gpu/drm/drm_ras.c | 43
>>> +++++++++++++++++++++++-
>>> drivers/gpu/drm/drm_ras_nl.c | 13 +++++++
>>> drivers/gpu/drm/drm_ras_nl.h | 2 ++
>>> include/drm/drm_ras.h | 11 ++++++
>>> include/uapi/drm/drm_ras.h | 1 +
>>> 7 files changed, 89 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-
>>> ras.rst
>>> index 70b246a78fc8..4636e68f5678 100644
>>> --- a/Documentation/gpu/drm-ras.rst
>>> +++ b/Documentation/gpu/drm-ras.rst
>>> @@ -52,6 +52,8 @@ User space tools can:
>>> as a parameter.
>>> * Query specific error counter values with the
>>> ``get-error-counter`` command, using both
>>> ``node-id`` and ``error-id`` as parameters.
>>> +* Clear specific error counters with the ``clear-error-counter``
>>> command, using both
>>> + ``node-id`` and ``error-id`` as parameters.
>>> YAML-based Interface
>>> --------------------
>>> @@ -101,3 +103,9 @@ Example: Query an error counter for a given node
>>> sudo ynl --family drm_ras --do get-error-counter --json
>>> '{"node- id":0, "error-id":1}'
>>> {'error-id': 1, 'error-name': 'error_name1', 'error-value': 0}
>>> +Example: Clear an error counter for a given node
>>> +
>>> +.. code-block:: bash
>>> +
>>> + sudo ynl --family drm_ras --do clear-error-counter --json
>>> '{"node-id":0, "error-id":1}'
>>> + None
>>> diff --git a/Documentation/netlink/specs/drm_ras.yaml
>>> b/Documentation/ netlink/specs/drm_ras.yaml
>>> index 79af25dac3c5..e113056f8c01 100644
>>> --- a/Documentation/netlink/specs/drm_ras.yaml
>>> +++ b/Documentation/netlink/specs/drm_ras.yaml
>>> @@ -99,7 +99,7 @@ operations:
>>> flags: [admin-perm]
>>> do:
>>> request:
>>> - attributes:
>>> + attributes: &id-attrs
>>> - node-id
>>> - error-id
>>> reply:
>>> @@ -113,3 +113,14 @@ operations:
>>> - node-id
>>> reply:
>>> attributes: *errorinfo
>>> + -
>>> + name: clear-error-counter
>>> + doc: >-
>>> + Clear error counter for a given node.
>>> + The request includes the error-id and node-id of the
>>> + counter to be cleared.
>>> + attribute-set: error-counter-attrs
>>> + flags: [admin-perm]
>>> + do:
>>> + request:
>>> + attributes: *id-attrs
>>> diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
>>> index b2fa5ab86d87..d6eab29a1394 100644
>>> --- a/drivers/gpu/drm/drm_ras.c
>>> +++ b/drivers/gpu/drm/drm_ras.c
>>> @@ -26,7 +26,7 @@
>>> * efficient lookup by ID. Nodes can be registered or unregistered
>>> * dynamically at runtime.
>>> *
>>> - * A Generic Netlink family `drm_ras` exposes two main operations to
>>> + * A Generic Netlink family `drm_ras` exposes the below operations to
>>> * userspace:
>>> *
>>> * 1. LIST_NODES: Dump all currently registered RAS nodes.
>>> @@ -37,6 +37,10 @@
>>> * Returns all counters of a node if only Node ID is provided
>>> or specific
>>> * error counters.
>>> *
>>> + * 3. CLEAR_ERROR_COUNTER: Clear error counter of a given node.
>>> + * Userspace must provide Node ID, Error ID.
>>> + * Clears specific error counter of a node if supported.
>>> + *
>>> * Node registration:
>>> *
>>> * - drm_ras_node_register(): Registers a new node and assigns
>>> @@ -66,6 +70,8 @@
>>> * operation, fetching all counters from a specific node.
>>> * - drm_ras_nl_get_error_counter_doit(): Implements the
>>> GET_ERROR_COUNTER doit
>>> * operation, fetching a counter value from a specific node.
>>> + * - drm_ras_nl_clear_error_counter_doit(): Implements the
>>> CLEAR_ERROR_COUNTER doit
>>> + * operation, clearing a counter value from a specific node.
>>> */
>>> static DEFINE_XARRAY_ALLOC(drm_ras_xa);
>>> @@ -314,6 +320,41 @@ int drm_ras_nl_get_error_counter_doit(struct
>>> sk_buff *skb,
>>> return doit_reply_value(info, node_id, error_id);
>>> }
>>> +/**
>>> + * drm_ras_nl_clear_error_counter_doit() - Clear an error counter
>>> of a node
>>> + * @skb: Netlink message buffer
>>> + * @info: Generic Netlink info containing attributes of the request
>>> + *
>>> + * Extracts the node ID and error ID from the netlink attributes and
>>> + * clears the current value.
>>> + *
>>> + * Return: 0 on success, or negative errno on failure.
>>> + */
>>> +int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
>>> + struct genl_info *info)
>>> +{
>>> + struct drm_ras_node *node;
>>> + u32 node_id, error_id;
>>> +
>>> + if (!info->attrs ||
>>> + GENL_REQ_ATTR_CHECK(info,
>>> DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID) ||
>>> + GENL_REQ_ATTR_CHECK(info,
>>> DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID))
>>> + return -EINVAL;
>>> +
>>> + node_id = nla_get_u32(info-
>>> >attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
>>> + error_id = nla_get_u32(info-
>>> >attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]);
>>> +
>>> + node = xa_load(&drm_ras_xa, node_id);
>>> + if (!node || !node->clear_error_counter)
>>> + return -ENOENT;
>>> +
>>> + if (error_id < node->error_counter_range.first ||
>>> + error_id > node->error_counter_range.last)
>>> + return -EINVAL;
>>> +
>>> + return node->clear_error_counter(node, error_id);
>>> +}
>>> +
>>> /**
>>> * drm_ras_node_register() - Register a new RAS node
>>> * @node: Node structure to register
>>> diff --git a/drivers/gpu/drm/drm_ras_nl.c
>>> b/drivers/gpu/drm/drm_ras_nl.c
>>> index 16803d0c4a44..dea1c1b2494e 100644
>>> --- a/drivers/gpu/drm/drm_ras_nl.c
>>> +++ b/drivers/gpu/drm/drm_ras_nl.c
>>> @@ -22,6 +22,12 @@ static const struct nla_policy
>>> drm_ras_get_error_counter_dump_nl_policy[DRM_RAS_
>>> [DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
>>> };
>>> +/* DRM_RAS_CMD_CLEAR_ERROR_COUNTER - do */
>>> +static const struct nla_policy
>>> drm_ras_clear_error_counter_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID
>>> + 1] = {
>>> + [DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
>>> + [DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
>>> +};
>>> +
>>> /* Ops table for drm_ras */
>>> static const struct genl_split_ops drm_ras_nl_ops[] = {
>>> {
>>> @@ -43,6 +49,13 @@ static const struct genl_split_ops
>>> drm_ras_nl_ops[] = {
>>> .maxattr = DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID,
>>> .flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
>>> },
>>> + {
>>> + .cmd = DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
>>> + .doit = drm_ras_nl_clear_error_counter_doit,
>>> + .policy = drm_ras_clear_error_counter_nl_policy,
>>> + .maxattr = DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
>>> + .flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
>>> + },
>>> };
>>> struct genl_family drm_ras_nl_family __ro_after_init = {
>>> diff --git a/drivers/gpu/drm/drm_ras_nl.h
>>> b/drivers/gpu/drm/drm_ras_nl.h
>>> index 06ccd9342773..a398643572a5 100644
>>> --- a/drivers/gpu/drm/drm_ras_nl.h
>>> +++ b/drivers/gpu/drm/drm_ras_nl.h
>>> @@ -18,6 +18,8 @@ int drm_ras_nl_get_error_counter_doit(struct
>>> sk_buff *skb,
>>> struct genl_info *info);
>>> int drm_ras_nl_get_error_counter_dumpit(struct sk_buff *skb,
>>> struct netlink_callback *cb);
>>> +int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
>>> + struct genl_info *info);
>>> extern struct genl_family drm_ras_nl_family;
>>> diff --git a/include/drm/drm_ras.h b/include/drm/drm_ras.h
>>> index 5d50209e51db..f2a787bc4f64 100644
>>> --- a/include/drm/drm_ras.h
>>> +++ b/include/drm/drm_ras.h
>>> @@ -58,6 +58,17 @@ struct drm_ras_node {
>>> int (*query_error_counter)(struct drm_ras_node *node, u32
>>> error_id,
>>> const char **name, u32 *val);
>>> + /**
>>> + * @clear_error_counter:
>>> + *
>>> + * This callback is used by drm_ras to clear a specific error
>>> counter.
>>> + * Driver should implement this callback to support clearing
>>> error counters
>>> + * of a node.
>>> + *
>>> + * Returns: 0 on success, negative error code on failure.
>>> + */
>>> + int (*clear_error_counter)(struct drm_ras_node *node, u32
>>> error_id);
>>> +
>>> /** @priv: Driver private data */
>>> void *priv;
>>> };
>>> diff --git a/include/uapi/drm/drm_ras.h b/include/uapi/drm/drm_ras.h
>>> index 5f40fa5b869d..218a3ee86805 100644
>>> --- a/include/uapi/drm/drm_ras.h
>>> +++ b/include/uapi/drm/drm_ras.h
>>> @@ -41,6 +41,7 @@ enum {
>>> enum {
>>> DRM_RAS_CMD_LIST_NODES = 1,
>>> DRM_RAS_CMD_GET_ERROR_COUNTER,
>>> + DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
>>> __DRM_RAS_CMD_MAX,
>>> DRM_RAS_CMD_MAX = (__DRM_RAS_CMD_MAX - 1)
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Claude review: Add clear-error-counter command to drm_ras
2026-04-09 7:33 [PATCH v2 0/2] Add clear-error-counter command to drm_ras Riana Tauro
2026-04-09 7:33 ` [PATCH v2 1/2] drm/drm_ras: Add clear-error-counter netlink " Riana Tauro
2026-04-09 7:33 ` [PATCH v2 2/2] drm/xe/xe_drm_ras: Add support for clear-error-counter in XE drm_ras Riana Tauro
@ 2026-04-12 1:34 ` Claude Code Review Bot
2 siblings, 0 replies; 12+ messages in thread
From: Claude Code Review Bot @ 2026-04-12 1:34 UTC (permalink / raw)
To: dri-devel-reviews
Overall Series Review
Subject: Add clear-error-counter command to drm_ras
Author: Riana Tauro <riana.tauro@intel.com>
Patches: 8
Reviewed: 2026-04-12T11:34:39.897653
---
This is a clean, well-structured 2-patch series that adds a `clear-error-counter` netlink command to the DRM RAS subsystem and implements it in the XE driver. The new command follows the established patterns from the existing `get-error-counter` command closely, making it easy to review. The YAML spec, UAPI, core framework, and driver callback are all properly wired up.
The series has already received Reviewed-by from Jakub Kicinski (patch 1, netlink aspects) and Raag Jadav (both patches). I have one functional concern about the error code when a driver doesn't implement the callback, and a few minor observations.
---
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 12+ messages in thread
* Claude review: drm/drm_ras: Add clear-error-counter netlink command to drm_ras
2026-04-09 7:33 ` [PATCH v2 1/2] drm/drm_ras: Add clear-error-counter netlink " Riana Tauro
2026-04-09 7:21 ` Tauro, Riana
@ 2026-04-12 1:34 ` Claude Code Review Bot
1 sibling, 0 replies; 12+ messages in thread
From: Claude Code Review Bot @ 2026-04-12 1:34 UTC (permalink / raw)
To: dri-devel-reviews
Patch Review
**Overall**: Solid implementation that mirrors the existing `get-error-counter` doit pattern. The YAML anchor usage to avoid duplicating the attribute list is a nice touch.
**Error code for missing callback**: In `drm_ras_nl_clear_error_counter_doit()`:
```c
node = xa_load(&drm_ras_xa, node_id);
if (!node || !node->clear_error_counter)
return -ENOENT;
```
When the node exists but the driver hasn't implemented `clear_error_counter`, returning `-ENOENT` is misleading — the node *was* found. Consider returning `-EOPNOTSUPP` (or `-ENOTSUPP`) for the missing-callback case, and `-ENOENT` only for the node-not-found case. This would give userspace a meaningful distinction:
```c
node = xa_load(&drm_ras_xa, node_id);
if (!node)
return -ENOENT;
if (!node->clear_error_counter)
return -EOPNOTSUPP;
```
This is consistent with how other kernel subsystems signal "operation not supported by this driver." The existing `get_node_error_counter()` has the same conflation of the two cases (`!node || !node->query_error_counter`), but since `query_error_counter` is effectively mandatory for any registered node, it's less of a practical issue there. For `clear_error_counter`, which is explicitly optional, the distinction matters more.
**Pre-existing leak in `doit_reply_value()`** (not introduced by this patch, just noting): At line ~202-205 of the existing code, if `get_node_error_counter()` fails, `msg` is leaked:
```c
msg = genlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
...
ret = get_node_error_counter(node_id, error_id, &error_name, &value);
if (ret)
return ret; /* msg leaked */
```
Not blocking for this series, but worth a separate fix.
**Everything else looks correct**:
- The YAML spec properly uses an anchor `&id-attrs` on the `get-error-counter` request attributes and references it with `*id-attrs` in `clear-error-counter`. Clean.
- The NLA policy correctly requires both `NODE_ID` and `ERROR_ID` as `NLA_U32`.
- The genl_split_ops entry correctly uses `GENL_ADMIN_PERM | GENL_CMD_CAP_DO` — admin-only, doit-only (no dump), which is appropriate for a destructive/write operation.
- The `clear_error_counter` callback in `struct drm_ras_node` is well-documented and has the right signature.
- The range check mirrors `get_node_error_counter()` exactly, which is good.
---
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 12+ messages in thread
* Claude review: drm/xe/xe_drm_ras: Add support for clear-error-counter in XE drm_ras
2026-04-09 7:33 ` [PATCH v2 2/2] drm/xe/xe_drm_ras: Add support for clear-error-counter in XE drm_ras Riana Tauro
@ 2026-04-12 1:34 ` Claude Code Review Bot
0 siblings, 0 replies; 12+ messages in thread
From: Claude Code Review Bot @ 2026-04-12 1:34 UTC (permalink / raw)
To: dri-devel-reviews
Patch Review
**Overall**: Straightforward driver-side implementation that closely follows the existing query callback pattern. No issues.
**`hw_clear_error_counter()`** correctly mirrors `hw_query_error_counter()`:
```c
static int hw_clear_error_counter(struct xe_drm_ras_counter *info, u32 error_id)
{
if (!info || !info[error_id].name)
return -ENOENT;
atomic_set(&info[error_id].counter, 0);
return 0;
}
```
The bounds safety is guaranteed by the core framework's range check (against `error_counter_range.first`/`.last`) before the callback is invoked, so the array access `info[error_id]` is safe. The `!info[error_id].name` check handles non-contiguous IDs gracefully.
**`atomic_set` for clearing** is appropriate — it matches the `atomic_read` in the query path and any `atomic_inc` in the error recording path. There is an inherent TOCTOU between clearing and the next read (a new error could arrive), but that's fundamental to the operation and not a bug.
**Code structure observation**: The `clear_uncorrectable_error_counter()` and `clear_correctable_error_counter()` functions are near-identical, differing only in the severity index used to select `ras->info[severity]`. The same duplication exists in the query callbacks. This is fine for two severities, but if more are added, a single parameterized function with a severity argument stored in the node's `priv` (or a wrapper struct) would be cleaner. Not a blocking concern for this series.
**`assign_node_params()` modification** is clean — adding braces around the existing if/else and inserting the callback assignment alongside the query callback:
```c
if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE) {
node->query_error_counter = query_correctable_error_counter;
node->clear_error_counter = clear_correctable_error_counter;
} else {
node->query_error_counter = query_uncorrectable_error_counter;
node->clear_error_counter = clear_uncorrectable_error_counter;
}
```
No issues with this patch.
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2026-04-12 1:34 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-09 7:33 [PATCH v2 0/2] Add clear-error-counter command to drm_ras Riana Tauro
2026-04-09 7:33 ` [PATCH v2 1/2] drm/drm_ras: Add clear-error-counter netlink " Riana Tauro
2026-04-09 7:21 ` Tauro, Riana
2026-04-09 13:37 ` Rodrigo Vivi
2026-04-10 5:21 ` Tauro, Riana
2026-04-09 23:01 ` Zack McKevitt
2026-04-10 5:25 ` Tauro, Riana
2026-04-12 1:34 ` Claude review: " Claude Code Review Bot
2026-04-09 7:33 ` [PATCH v2 2/2] drm/xe/xe_drm_ras: Add support for clear-error-counter in XE drm_ras Riana Tauro
2026-04-12 1:34 ` Claude review: " Claude Code Review Bot
2026-04-12 1:34 ` Claude review: Add clear-error-counter command to drm_ras Claude Code Review Bot
-- strict thread matches above, loose matches on Subject: below --
2026-03-11 10:29 [PATCH 0/4] Add support for clear counter and error event in DRM RAS Riana Tauro
2026-03-11 10:29 ` [PATCH 1/4] drm/drm_ras: Add clear-error-counter netlink command to drm_ras Riana Tauro
2026-03-11 21:06 ` Claude review: " Claude Code Review Bot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox