* [PATCH v3 1/4] drm/ras: Cancel and free message on get counter failure
2026-06-02 4:48 [PATCH v3 0/4] DRM RAS Fixes Raag Jadav
@ 2026-06-02 4:48 ` Raag Jadav
2026-06-04 3:22 ` Claude review: " Claude Code Review Bot
2026-06-02 4:48 ` [PATCH v3 2/4] drm/xe/drm_ras: Make counter allocation drm managed Raag Jadav
` (3 subsequent siblings)
4 siblings, 1 reply; 12+ messages in thread
From: Raag Jadav @ 2026-06-02 4:48 UTC (permalink / raw)
To: intel-xe, dri-devel, netdev
Cc: rodrigo.vivi, riana.tauro, dev, airlied, simona, kuba, Raag Jadav
doit_reply_value() directly returns on get counter failure, which results
in stale sk_buff and genetlink header that aren't cleaned up. Fix it and
while at it, consolidate error handling using goto.
Fixes: c36218dc49f5 ("drm/ras: Introduce the DRM RAS infrastructure over generic netlink")
Signed-off-by: Raag Jadav <raag.jadav@intel.com>
Reviewed-by: Riana Tauro <riana.tauro@intel.com>
---
v2: Use goto (Riana)
---
drivers/gpu/drm/drm_ras.c | 19 +++++++++++--------
1 file changed, 11 insertions(+), 8 deletions(-)
diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
index d6eab29a1394..467a169026fc 100644
--- a/drivers/gpu/drm/drm_ras.c
+++ b/drivers/gpu/drm/drm_ras.c
@@ -201,25 +201,28 @@ static int doit_reply_value(struct genl_info *info, u32 node_id,
hdr = genlmsg_iput(msg, info);
if (!hdr) {
- nlmsg_free(msg);
- return -EMSGSIZE;
+ ret = -EMSGSIZE;
+ goto free_msg;
}
ret = get_node_error_counter(node_id, error_id,
&error_name, &value);
if (ret)
- return ret;
+ goto cancel_msg;
ret = msg_reply_value(msg, error_id, error_name, value);
- if (ret) {
- genlmsg_cancel(msg, hdr);
- nlmsg_free(msg);
- return ret;
- }
+ if (ret)
+ goto cancel_msg;
genlmsg_end(msg, hdr);
return genlmsg_reply(msg, info);
+
+cancel_msg:
+ genlmsg_cancel(msg, hdr);
+free_msg:
+ nlmsg_free(msg);
+ return ret;
}
/**
--
2.43.0
^ permalink raw reply related [flat|nested] 12+ messages in thread* Claude review: drm/ras: Cancel and free message on get counter failure
2026-06-02 4:48 ` [PATCH v3 1/4] drm/ras: Cancel and free message on get counter failure Raag Jadav
@ 2026-06-04 3:22 ` Claude Code Review Bot
0 siblings, 0 replies; 12+ messages in thread
From: Claude Code Review Bot @ 2026-06-04 3:22 UTC (permalink / raw)
To: dri-devel-reviews
Patch Review
**Verdict: Good fix, correct.**
The original code had two issues in `doit_reply_value()`:
1. When `get_node_error_counter()` failed, `msg` (the sk_buff) and `hdr` (the genetlink header) were leaked.
2. The existing error handling for `msg_reply_value()` failure had its own cleanup copy-paste.
The goto-based consolidation is clean and handles the `hdr == NULL` case correctly — when `genlmsg_iput()` fails, the jump goes to `free_msg` (skipping `genlmsg_cancel`), which is right since there's no valid header to cancel:
```c
cancel_msg:
genlmsg_cancel(msg, hdr);
free_msg:
nlmsg_free(msg);
return ret;
```
No issues.
---
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v3 2/4] drm/xe/drm_ras: Make counter allocation drm managed
2026-06-02 4:48 [PATCH v3 0/4] DRM RAS Fixes Raag Jadav
2026-06-02 4:48 ` [PATCH v3 1/4] drm/ras: Cancel and free message on get counter failure Raag Jadav
@ 2026-06-02 4:48 ` Raag Jadav
2026-06-04 3:22 ` Claude review: " Claude Code Review Bot
2026-06-02 4:48 ` [PATCH v3 3/4] drm/xe/drm_ras: Add per node cleanup action Raag Jadav
` (2 subsequent siblings)
4 siblings, 1 reply; 12+ messages in thread
From: Raag Jadav @ 2026-06-02 4:48 UTC (permalink / raw)
To: intel-xe, dri-devel, netdev
Cc: rodrigo.vivi, riana.tauro, dev, airlied, simona, kuba, Raag Jadav
cleanup_node_param() is not registered for previous node in case of counter
allocation failure, which results in stale memory of previous node that
isn't cleaned up on unwind. Fix this using drm managed allocation, which is
guaranteed to be cleaned up on unwind.
Fixes: b40db12b542f ("drm/xe/xe_drm_ras: Add support for XE DRM RAS")
Signed-off-by: Raag Jadav <raag.jadav@intel.com>
---
v2: Retain info as NULL on failure (Riana)
---
drivers/gpu/drm/xe/xe_drm_ras.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_drm_ras.c b/drivers/gpu/drm/xe/xe_drm_ras.c
index c21c8b428de6..c1d5ac198a7c 100644
--- a/drivers/gpu/drm/xe/xe_drm_ras.c
+++ b/drivers/gpu/drm/xe/xe_drm_ras.c
@@ -80,7 +80,7 @@ static struct xe_drm_ras_counter *allocate_and_copy_counters(struct xe_device *x
struct xe_drm_ras_counter *counter;
int i;
- counter = kcalloc(DRM_XE_RAS_ERR_COMP_MAX, sizeof(*counter), GFP_KERNEL);
+ counter = drmm_kcalloc(&xe->drm, DRM_XE_RAS_ERR_COMP_MAX, sizeof(*counter), GFP_KERNEL);
if (!counter)
return ERR_PTR(-ENOMEM);
@@ -135,7 +135,6 @@ static void cleanup_node_param(struct xe_drm_ras *ras, const enum drm_xe_ras_err
{
struct drm_ras_node *node = &ras->node[severity];
- kfree(ras->info[severity]);
ras->info[severity] = NULL;
kfree(node->device_name);
--
2.43.0
^ permalink raw reply related [flat|nested] 12+ messages in thread* Claude review: drm/xe/drm_ras: Make counter allocation drm managed
2026-06-02 4:48 ` [PATCH v3 2/4] drm/xe/drm_ras: Make counter allocation drm managed Raag Jadav
@ 2026-06-04 3:22 ` Claude Code Review Bot
0 siblings, 0 replies; 12+ messages in thread
From: Claude Code Review Bot @ 2026-06-04 3:22 UTC (permalink / raw)
To: dri-devel-reviews
Patch Review
**Verdict: Correct.**
Switching from `kcalloc` to `drmm_kcalloc` for the counter array ensures the counter memory is freed during DRM device teardown even if `cleanup_node_param()` is never called for a particular severity level. This fixes the leak where a failure on node N left node N-1's counter memory orphaned (since `cleanup_node_param` was only invoked for the failing node, not previously-successful ones).
The corresponding removal of `kfree(ras->info[severity])` from `cleanup_node_param` is correct — double-freeing drmm-managed memory would corrupt the allocator. The `ras->info[severity] = NULL` is retained, which prevents stale pointer dereferences.
No issues.
---
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v3 3/4] drm/xe/drm_ras: Add per node cleanup action
2026-06-02 4:48 [PATCH v3 0/4] DRM RAS Fixes Raag Jadav
2026-06-02 4:48 ` [PATCH v3 1/4] drm/ras: Cancel and free message on get counter failure Raag Jadav
2026-06-02 4:48 ` [PATCH v3 2/4] drm/xe/drm_ras: Make counter allocation drm managed Raag Jadav
@ 2026-06-02 4:48 ` Raag Jadav
2026-06-04 3:22 ` Claude review: " Claude Code Review Bot
2026-06-02 4:48 ` [PATCH v3 4/4] drm/xe/hw_error: Use HW_ERR prefix in log Raag Jadav
2026-06-04 3:21 ` Claude review: DRM RAS Fixes Claude Code Review Bot
4 siblings, 1 reply; 12+ messages in thread
From: Raag Jadav @ 2026-06-02 4:48 UTC (permalink / raw)
To: intel-xe, dri-devel, netdev
Cc: rodrigo.vivi, riana.tauro, dev, airlied, simona, kuba, Raag Jadav
cleanup_node_param() is not registered for previous node in case of counter
allocation failure, which results in stale memory of previous node that
isn't cleaned up on unwind. Add per node cleanup action which guarantees
cleanup on unwind and also simplifies the cleanup logic.
Fixes: b40db12b542f ("drm/xe/xe_drm_ras: Add support for XE DRM RAS")
Signed-off-by: Raag Jadav <raag.jadav@intel.com>
---
v3: Prevent double free on drmm_add_action_or_reset() failure (Riana)
---
drivers/gpu/drm/xe/xe_drm_ras.c | 58 +++++++++++++--------------------
1 file changed, 23 insertions(+), 35 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_drm_ras.c b/drivers/gpu/drm/xe/xe_drm_ras.c
index c1d5ac198a7c..cd236f53699e 100644
--- a/drivers/gpu/drm/xe/xe_drm_ras.c
+++ b/drivers/gpu/drm/xe/xe_drm_ras.c
@@ -131,53 +131,47 @@ static int assign_node_params(struct xe_device *xe, struct drm_ras_node *node,
return 0;
}
-static void cleanup_node_param(struct xe_drm_ras *ras, const enum drm_xe_ras_error_severity severity)
+static void cleanup_node_param(struct drm_ras_node *node)
{
- struct drm_ras_node *node = &ras->node[severity];
-
- ras->info[severity] = NULL;
-
kfree(node->device_name);
node->device_name = NULL;
}
+static void cleanup_node(struct drm_device *drm, void *node)
+{
+ drm_ras_node_unregister(node);
+ cleanup_node_param(node);
+}
+
static int register_nodes(struct xe_device *xe)
{
struct xe_drm_ras *ras = &xe->ras;
- int i;
+ struct drm_ras_node *node;
+ int i, ret;
for_each_error_severity(i) {
- struct drm_ras_node *node = &ras->node[i];
- int ret;
+ node = &ras->node[i];
ret = assign_node_params(xe, node, i);
- if (ret) {
- cleanup_node_param(ras, i);
- return ret;
- }
+ if (ret)
+ goto free_param;
ret = drm_ras_node_register(node);
- if (ret) {
- cleanup_node_param(ras, i);
- return ret;
- }
+ if (ret)
+ goto free_param;
+
+ ret = drmm_add_action_or_reset(&xe->drm, cleanup_node, node);
+ if (ret)
+ goto null_info;
}
return 0;
-}
-static void xe_drm_ras_unregister_nodes(struct drm_device *device, void *arg)
-{
- struct xe_device *xe = arg;
- struct xe_drm_ras *ras = &xe->ras;
- int i;
-
- for_each_error_severity(i) {
- struct drm_ras_node *node = &ras->node[i];
-
- drm_ras_node_unregister(node);
- cleanup_node_param(ras, i);
- }
+free_param:
+ cleanup_node_param(node);
+null_info:
+ ras->info[i] = NULL;
+ return ret;
}
/**
@@ -206,11 +200,5 @@ int xe_drm_ras_init(struct xe_device *xe)
return err;
}
- err = drmm_add_action_or_reset(&xe->drm, xe_drm_ras_unregister_nodes, xe);
- if (err) {
- drm_err(&xe->drm, "Failed to add action for Xe DRM RAS (%pe)\n", ERR_PTR(err));
- return err;
- }
-
return 0;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 12+ messages in thread* Claude review: drm/xe/drm_ras: Add per node cleanup action
2026-06-02 4:48 ` [PATCH v3 3/4] drm/xe/drm_ras: Add per node cleanup action Raag Jadav
@ 2026-06-04 3:22 ` Claude Code Review Bot
0 siblings, 0 replies; 12+ messages in thread
From: Claude Code Review Bot @ 2026-06-04 3:22 UTC (permalink / raw)
To: dri-devel-reviews
Patch Review
**Verdict: Correct, with one observation.**
This is the most substantial change. It replaces the monolithic `xe_drm_ras_unregister_nodes()` cleanup (registered once for all nodes) with per-node `drmm_add_action_or_reset()` calls, guaranteeing that each successfully-initialized node is cleaned up individually during unwind.
The error paths in `register_nodes()` are carefully structured:
```c
ret = drmm_add_action_or_reset(&xe->drm, cleanup_node, node);
if (ret)
goto null_info;
```
Jumping to `null_info` (not `free_param`) on `drmm_add_action_or_reset` failure is critical. Since the `_or_reset` variant calls `cleanup_node()` itself on failure (which frees `device_name` and unregisters the node), jumping to `free_param` would double-free `device_name`. The v3 changelog confirms this was caught by a reviewer — good catch.
**Observation:** After patch 3, `cleanup_node_param()` no longer NULLs out `ras->info[severity]`. During normal device teardown, the drmm actions run `cleanup_node()` → `cleanup_node_param()`, which frees `device_name` but leaves `ras->info[severity]` as a dangling pointer (the counter memory is freed separately by its own drmm tracking). This is safe because the xe_device is being torn down, but it's a departure from the defensive NULL-after-free pattern. Worth noting but not blocking — the init-time error path in `register_nodes()` does NULL it out via `null_info`, which is the path where it matters.
**Teardown ordering is correct:** drmm releases in reverse registration order, so per-node `cleanup_node` actions run before the node array itself is freed (`drmm_kcalloc` in `xe_drm_ras_init`), ensuring the nodes are still valid when `drm_ras_node_unregister` and `cleanup_node_param` access them.
---
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v3 4/4] drm/xe/hw_error: Use HW_ERR prefix in log
2026-06-02 4:48 [PATCH v3 0/4] DRM RAS Fixes Raag Jadav
` (2 preceding siblings ...)
2026-06-02 4:48 ` [PATCH v3 3/4] drm/xe/drm_ras: Add per node cleanup action Raag Jadav
@ 2026-06-02 4:48 ` Raag Jadav
2026-06-04 3:22 ` Claude review: " Claude Code Review Bot
2026-06-04 3:21 ` Claude review: DRM RAS Fixes Claude Code Review Bot
4 siblings, 1 reply; 12+ messages in thread
From: Raag Jadav @ 2026-06-02 4:48 UTC (permalink / raw)
To: intel-xe, dri-devel, netdev
Cc: rodrigo.vivi, riana.tauro, dev, airlied, simona, kuba, Raag Jadav
Hardware errors should be logged with HW_ERR prefix. Make them
consistent with existing logs.
Fixes: 01aab7e1c9d4 ("drm/xe/xe_hw_error: Add support for PVC SoC errors")
Signed-off-by: Raag Jadav <raag.jadav@intel.com>
Reviewed-by: Riana Tauro <riana.tauro@intel.com>
---
drivers/gpu/drm/xe/xe_hw_error.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
index 5135e8e4093f..4b72959b2276 100644
--- a/drivers/gpu/drm/xe/xe_hw_error.c
+++ b/drivers/gpu/drm/xe/xe_hw_error.c
@@ -223,9 +223,9 @@ static void log_hw_error(struct xe_tile *tile, const char *name,
struct xe_device *xe = tile_to_xe(tile);
if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE)
- drm_warn(&xe->drm, "%s %s detected\n", name, severity_str);
+ drm_warn(&xe->drm, HW_ERR "%s %s detected\n", name, severity_str);
else
- drm_err_ratelimited(&xe->drm, "%s %s detected\n", name, severity_str);
+ drm_err_ratelimited(&xe->drm, HW_ERR "%s %s detected\n", name, severity_str);
}
static void log_gt_err(struct xe_tile *tile, const char *name, int i, u32 err,
@@ -235,10 +235,10 @@ static void log_gt_err(struct xe_tile *tile, const char *name, int i, u32 err,
struct xe_device *xe = tile_to_xe(tile);
if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE)
- drm_warn(&xe->drm, "%s %s detected, ERROR_STAT_GT_VECTOR%d:0x%08x\n",
+ drm_warn(&xe->drm, HW_ERR "%s %s detected, ERROR_STAT_GT_VECTOR%d:0x%08x\n",
name, severity_str, i, err);
else
- drm_err_ratelimited(&xe->drm, "%s %s detected, ERROR_STAT_GT_VECTOR%d:0x%08x\n",
+ drm_err_ratelimited(&xe->drm, HW_ERR "%s %s detected, ERROR_STAT_GT_VECTOR%d:0x%08x\n",
name, severity_str, i, err);
}
@@ -255,9 +255,9 @@ static void log_soc_error(struct xe_tile *tile, const char * const *reg_info,
if (strcmp(name, "Undefined")) {
if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE)
- drm_warn(&xe->drm, "%s SOC %s detected", name, severity_str);
+ drm_warn(&xe->drm, HW_ERR "%s SOC %s detected", name, severity_str);
else
- drm_err_ratelimited(&xe->drm, "%s SOC %s detected", name, severity_str);
+ drm_err_ratelimited(&xe->drm, HW_ERR "%s SOC %s detected", name, severity_str);
atomic_inc(&info[index].counter);
}
}
--
2.43.0
^ permalink raw reply related [flat|nested] 12+ messages in thread* Claude review: drm/xe/hw_error: Use HW_ERR prefix in log
2026-06-02 4:48 ` [PATCH v3 4/4] drm/xe/hw_error: Use HW_ERR prefix in log Raag Jadav
@ 2026-06-04 3:22 ` Claude Code Review Bot
0 siblings, 0 replies; 12+ messages in thread
From: Claude Code Review Bot @ 2026-06-04 3:22 UTC (permalink / raw)
To: dri-devel-reviews
Patch Review
**Verdict: Fine, minor nit.**
Adds the standard kernel `HW_ERR` (`"[Hardware Error]: "`) prefix to the `log_hw_error`, `log_gt_err`, and `log_soc_error` functions. The file already uses `HW_ERR` extensively elsewhere (e.g., lines 199, 207, 443, 475, 479 in the patched file), so this makes these functions consistent.
**Minor nit:** The `Fixes:` tag references `01aab7e1c9d4 ("drm/xe/xe_hw_error: Add support for PVC SoC errors")`. Using a Fixes tag implies stable-kernel backport eligibility, which seems heavy for a log-format consistency change. This is ultimately a maintainer judgment call — if there's tooling that relies on `HW_ERR` for parsing or alerting, the Fixes tag is justified.
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 12+ messages in thread
* Claude review: DRM RAS Fixes
2026-06-02 4:48 [PATCH v3 0/4] DRM RAS Fixes Raag Jadav
` (3 preceding siblings ...)
2026-06-02 4:48 ` [PATCH v3 4/4] drm/xe/hw_error: Use HW_ERR prefix in log Raag Jadav
@ 2026-06-04 3:21 ` Claude Code Review Bot
4 siblings, 0 replies; 12+ messages in thread
From: Claude Code Review Bot @ 2026-06-04 3:21 UTC (permalink / raw)
To: dri-devel-reviews
Overall Series Review
Subject: DRM RAS Fixes
Author: Raag Jadav <raag.jadav@intel.com>
Patches: 5
Reviewed: 2026-06-04T13:21:59.780615
---
This is a 4-patch bugfix series for the DRM RAS (Reliability, Availability, Serviceability) subsystem in the Xe driver. The first three patches fix resource leaks and cleanup ordering issues, and the fourth is a cosmetic log-prefix fix.
**Patches 1-3** address real resource leaks. The core problem is that error paths in the RAS initialization left stale sk_buffs, counter memory, and `device_name` strings uncleaned. The approach — consolidating error handling with gotos (patch 1), switching to drm-managed counter allocation (patch 2), and registering per-node cleanup actions (patch 3) — is idiomatic for DRM drivers and correct. The series is bisectable: each patch independently fixes a distinct leak without regressing others.
**Patch 4** is a straightforward cosmetic fix using the standard kernel `HW_ERR` prefix.
The series is in good shape. I have one correctness concern in patch 3 and a few minor observations.
---
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 12+ messages in thread