* [PATCH v1 0/3] DRM RAS Fixes
@ 2026-05-14 20:28 Raag Jadav
2026-05-14 20:28 ` [PATCH v1 1/3] drm/ras: Cancel and free message on get counter failure Raag Jadav
` (3 more replies)
0 siblings, 4 replies; 8+ messages in thread
From: Raag Jadav @ 2026-05-14 20:28 UTC (permalink / raw)
To: intel-xe, dri-devel, netdev
Cc: rodrigo.vivi, riana.tauro, maarten, airlied, simona, kuba,
Raag Jadav
Detailed description in commit message.
Raag Jadav (3):
drm/ras: Cancel and free message on get counter failure
drm/xe/drm_ras: Make counter allocation drm managed
drm/xe/drm_ras: Add per node cleanup action
drivers/gpu/drm/drm_ras.c | 5 +++-
drivers/gpu/drm/xe/xe_drm_ras.c | 47 +++++++++++++--------------------
2 files changed, 22 insertions(+), 30 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH v1 1/3] drm/ras: Cancel and free message on get counter failure
2026-05-14 20:28 [PATCH v1 0/3] DRM RAS Fixes Raag Jadav
@ 2026-05-14 20:28 ` Raag Jadav
2026-05-16 0:24 ` Claude review: " Claude Code Review Bot
2026-05-14 20:28 ` [PATCH v1 2/3] drm/xe/drm_ras: Make counter allocation drm managed Raag Jadav
` (2 subsequent siblings)
3 siblings, 1 reply; 8+ messages in thread
From: Raag Jadav @ 2026-05-14 20:28 UTC (permalink / raw)
To: intel-xe, dri-devel, netdev
Cc: rodrigo.vivi, riana.tauro, maarten, airlied, simona, kuba,
Raag Jadav
doit_reply_value() directly returns on get counter failure, which results
in stale sk_buff and genetlink header that aren't cleaned up. Fix it.
Fixes: c36218dc49f5 ("drm/ras: Introduce the DRM RAS infrastructure over generic netlink")
Signed-off-by: Raag Jadav <raag.jadav@intel.com>
---
drivers/gpu/drm/drm_ras.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
index d6eab29a1394..262a75f0c493 100644
--- a/drivers/gpu/drm/drm_ras.c
+++ b/drivers/gpu/drm/drm_ras.c
@@ -207,8 +207,11 @@ static int doit_reply_value(struct genl_info *info, u32 node_id,
ret = get_node_error_counter(node_id, error_id,
&error_name, &value);
- if (ret)
+ if (ret) {
+ genlmsg_cancel(msg, hdr);
+ nlmsg_free(msg);
return ret;
+ }
ret = msg_reply_value(msg, error_id, error_name, value);
if (ret) {
--
2.43.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v1 2/3] drm/xe/drm_ras: Make counter allocation drm managed
2026-05-14 20:28 [PATCH v1 0/3] DRM RAS Fixes Raag Jadav
2026-05-14 20:28 ` [PATCH v1 1/3] drm/ras: Cancel and free message on get counter failure Raag Jadav
@ 2026-05-14 20:28 ` Raag Jadav
2026-05-16 0:24 ` Claude review: " Claude Code Review Bot
2026-05-14 20:28 ` [PATCH v1 3/3] drm/xe/drm_ras: Add per node cleanup action Raag Jadav
2026-05-16 0:24 ` Claude review: DRM RAS Fixes Claude Code Review Bot
3 siblings, 1 reply; 8+ messages in thread
From: Raag Jadav @ 2026-05-14 20:28 UTC (permalink / raw)
To: intel-xe, dri-devel, netdev
Cc: rodrigo.vivi, riana.tauro, maarten, airlied, simona, kuba,
Raag Jadav
cleanup_node_param() is not registered in case of counter allocation
failure, which results in stale memory of previous node that isn't
cleaned up on unwind. Fix this using drm managed allocation, which is
guaranteed to be cleaned up on unwind.
Fixes: b40db12b542f ("drm/xe/xe_drm_ras: Add support for XE DRM RAS")
Signed-off-by: Raag Jadav <raag.jadav@intel.com>
---
drivers/gpu/drm/xe/xe_drm_ras.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_drm_ras.c b/drivers/gpu/drm/xe/xe_drm_ras.c
index c21c8b428de6..89640ffb1c33 100644
--- a/drivers/gpu/drm/xe/xe_drm_ras.c
+++ b/drivers/gpu/drm/xe/xe_drm_ras.c
@@ -80,7 +80,7 @@ static struct xe_drm_ras_counter *allocate_and_copy_counters(struct xe_device *x
struct xe_drm_ras_counter *counter;
int i;
- counter = kcalloc(DRM_XE_RAS_ERR_COMP_MAX, sizeof(*counter), GFP_KERNEL);
+ counter = drmm_kcalloc(&xe->drm, DRM_XE_RAS_ERR_COMP_MAX, sizeof(*counter), GFP_KERNEL);
if (!counter)
return ERR_PTR(-ENOMEM);
@@ -135,9 +135,6 @@ static void cleanup_node_param(struct xe_drm_ras *ras, const enum drm_xe_ras_err
{
struct drm_ras_node *node = &ras->node[severity];
- kfree(ras->info[severity]);
- ras->info[severity] = NULL;
-
kfree(node->device_name);
node->device_name = NULL;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v1 3/3] drm/xe/drm_ras: Add per node cleanup action
2026-05-14 20:28 [PATCH v1 0/3] DRM RAS Fixes Raag Jadav
2026-05-14 20:28 ` [PATCH v1 1/3] drm/ras: Cancel and free message on get counter failure Raag Jadav
2026-05-14 20:28 ` [PATCH v1 2/3] drm/xe/drm_ras: Make counter allocation drm managed Raag Jadav
@ 2026-05-14 20:28 ` Raag Jadav
2026-05-16 0:24 ` Claude review: " Claude Code Review Bot
2026-05-16 0:24 ` Claude review: DRM RAS Fixes Claude Code Review Bot
3 siblings, 1 reply; 8+ messages in thread
From: Raag Jadav @ 2026-05-14 20:28 UTC (permalink / raw)
To: intel-xe, dri-devel, netdev
Cc: rodrigo.vivi, riana.tauro, maarten, airlied, simona, kuba,
Raag Jadav
cleanup_node_param() is not registered in case of counter allocation
failure, which results in stale memory of previous node that isn't
cleaned up on unwind. Add per node cleanup action which guarantees
cleanup on unwind and also simplifies the cleanup logic.
Fixes: b40db12b542f ("drm/xe/xe_drm_ras: Add support for XE DRM RAS")
Signed-off-by: Raag Jadav <raag.jadav@intel.com>
---
drivers/gpu/drm/xe/xe_drm_ras.c | 42 +++++++++++++--------------------
1 file changed, 17 insertions(+), 25 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_drm_ras.c b/drivers/gpu/drm/xe/xe_drm_ras.c
index 89640ffb1c33..40abde29a26f 100644
--- a/drivers/gpu/drm/xe/xe_drm_ras.c
+++ b/drivers/gpu/drm/xe/xe_drm_ras.c
@@ -131,14 +131,20 @@ static int assign_node_params(struct xe_device *xe, struct drm_ras_node *node,
return 0;
}
-static void cleanup_node_param(struct xe_drm_ras *ras, const enum drm_xe_ras_error_severity severity)
+static void cleanup_node_param(struct drm_ras_node *node)
{
- struct drm_ras_node *node = &ras->node[severity];
-
kfree(node->device_name);
node->device_name = NULL;
}
+static void cleanup_node(struct drm_device *drm, void *arg)
+{
+ struct drm_ras_node *node = arg;
+
+ drm_ras_node_unregister(node);
+ cleanup_node_param(node);
+}
+
static int register_nodes(struct xe_device *xe)
{
struct xe_drm_ras *ras = &xe->ras;
@@ -150,13 +156,19 @@ static int register_nodes(struct xe_device *xe)
ret = assign_node_params(xe, node, i);
if (ret) {
- cleanup_node_param(ras, i);
+ cleanup_node_param(node);
return ret;
}
ret = drm_ras_node_register(node);
if (ret) {
- cleanup_node_param(ras, i);
+ cleanup_node_param(node);
+ return ret;
+ }
+
+ ret = drmm_add_action_or_reset(&xe->drm, cleanup_node, node);
+ if (ret) {
+ cleanup_node(&xe->drm, node);
return ret;
}
}
@@ -164,20 +176,6 @@ static int register_nodes(struct xe_device *xe)
return 0;
}
-static void xe_drm_ras_unregister_nodes(struct drm_device *device, void *arg)
-{
- struct xe_device *xe = arg;
- struct xe_drm_ras *ras = &xe->ras;
- int i;
-
- for_each_error_severity(i) {
- struct drm_ras_node *node = &ras->node[i];
-
- drm_ras_node_unregister(node);
- cleanup_node_param(ras, i);
- }
-}
-
/**
* xe_drm_ras_init() - Initialize DRM RAS
* @xe: xe device instance
@@ -204,11 +202,5 @@ int xe_drm_ras_init(struct xe_device *xe)
return err;
}
- err = drmm_add_action_or_reset(&xe->drm, xe_drm_ras_unregister_nodes, xe);
- if (err) {
- drm_err(&xe->drm, "Failed to add action for Xe DRM RAS (%pe)\n", ERR_PTR(err));
- return err;
- }
-
return 0;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Claude review: DRM RAS Fixes
2026-05-14 20:28 [PATCH v1 0/3] DRM RAS Fixes Raag Jadav
` (2 preceding siblings ...)
2026-05-14 20:28 ` [PATCH v1 3/3] drm/xe/drm_ras: Add per node cleanup action Raag Jadav
@ 2026-05-16 0:24 ` Claude Code Review Bot
3 siblings, 0 replies; 8+ messages in thread
From: Claude Code Review Bot @ 2026-05-16 0:24 UTC (permalink / raw)
To: dri-devel-reviews
Overall Series Review
Subject: DRM RAS Fixes
Author: Raag Jadav <raag.jadav@intel.com>
Patches: 4
Reviewed: 2026-05-16T10:24:55.790178
---
This is a 3-patch series fixing resource leak and cleanup issues in the DRM RAS (Reliability, Availability, Serviceability) infrastructure and the Xe driver's use of it. The fixes are real and the direction is correct: patch 1 plugs a netlink message leak, patch 2 switches to managed allocation, and patch 3 moves to per-node cleanup actions for robust unwind. However, **patch 3 has a double-cleanup bug** on the `drmm_add_action_or_reset` failure path that needs to be fixed before merging.
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 8+ messages in thread
* Claude review: drm/ras: Cancel and free message on get counter failure
2026-05-14 20:28 ` [PATCH v1 1/3] drm/ras: Cancel and free message on get counter failure Raag Jadav
@ 2026-05-16 0:24 ` Claude Code Review Bot
0 siblings, 0 replies; 8+ messages in thread
From: Claude Code Review Bot @ 2026-05-16 0:24 UTC (permalink / raw)
To: dri-devel-reviews
Patch Review
**Verdict: Good fix, ready to merge.**
This correctly fixes a resource leak in `doit_reply_value()`. When `get_node_error_counter()` fails, the previously allocated `sk_buff` (`msg`) and the genetlink header (`hdr`) were leaked because the function returned early without cleanup.
```c
if (ret) {
genlmsg_cancel(msg, hdr);
nlmsg_free(msg);
return ret;
}
```
This exactly mirrors the existing error handling for the `msg_reply_value()` failure path immediately below (lines 217-221 in the patched file). Clean, minimal, and correct. The Fixes tag is appropriate.
---
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 8+ messages in thread
* Claude review: drm/xe/drm_ras: Make counter allocation drm managed
2026-05-14 20:28 ` [PATCH v1 2/3] drm/xe/drm_ras: Make counter allocation drm managed Raag Jadav
@ 2026-05-16 0:24 ` Claude Code Review Bot
0 siblings, 0 replies; 8+ messages in thread
From: Claude Code Review Bot @ 2026-05-16 0:24 UTC (permalink / raw)
To: dri-devel-reviews
Patch Review
**Verdict: Correct fix, minor observation.**
Switching from `kcalloc` to `drmm_kcalloc` ensures the counter array is freed automatically during drm device release, which eliminates the problem where `cleanup_node_param()` wasn't called if counter allocation itself failed in a subsequent iteration — the previous iteration's counter would be leaked.
```c
counter = drmm_kcalloc(&xe->drm, DRM_XE_RAS_ERR_COMP_MAX, sizeof(*counter), GFP_KERNEL);
```
The corresponding removal of `kfree(ras->info[severity])` from `cleanup_node_param()` is correct since drmm now owns the lifetime:
```c
- kfree(ras->info[severity]);
- ras->info[severity] = NULL;
```
**Minor observation**: When `allocate_and_copy_counters()` fails, `ras->info[severity]` is left holding an `ERR_PTR` value (set in `assign_node_params()` line 120-121). While this isn't a functional bug since the init path returns an error and nothing will dereference it, it would be cleaner to set `ras->info[severity] = NULL` after the `IS_ERR` check in `assign_node_params()`. Not a blocking issue.
---
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 8+ messages in thread
* Claude review: drm/xe/drm_ras: Add per node cleanup action
2026-05-14 20:28 ` [PATCH v1 3/3] drm/xe/drm_ras: Add per node cleanup action Raag Jadav
@ 2026-05-16 0:24 ` Claude Code Review Bot
0 siblings, 0 replies; 8+ messages in thread
From: Claude Code Review Bot @ 2026-05-16 0:24 UTC (permalink / raw)
To: dri-devel-reviews
Patch Review
**Verdict: Has a bug — double cleanup on `drmm_add_action_or_reset` failure.**
The overall refactoring from a single bulk cleanup action (`xe_drm_ras_unregister_nodes`) to per-node cleanup actions is a sound improvement. It guarantees that each successfully registered node is individually cleaned up during unwind, which solves the problem described.
However, this code has a bug:
```c
ret = drmm_add_action_or_reset(&xe->drm, cleanup_node, node);
if (ret) {
cleanup_node(&xe->drm, node);
return ret;
}
```
The `_or_reset` suffix in `drmm_add_action_or_reset` means the action is **already called on failure** (see `include/drm/drm_managed.h` line 38-39: *"upon failure @action is directly called for any cleanup work"*). Explicitly calling `cleanup_node()` again results in:
1. Double call to `drm_ras_node_unregister(node)` — the second call operates on an already-unregistered node
2. Double `kfree(node->device_name)` — harmless because the first call NULLs the pointer, but still wrong
The fix is simple — just `return ret`:
```c
ret = drmm_add_action_or_reset(&xe->drm, cleanup_node, node);
if (ret)
return ret;
```
**Commit message nit**: The commit message repeats the same description as patch 2 verbatim ("cleanup_node_param() is not registered in case of counter allocation failure, which results in stale memory of previous node..."). Since patch 2 already fixed the counter allocation issue, patch 3's description should focus on what it actually does: moving to per-node drmm cleanup actions for robust unwind and simplification.
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2026-05-16 0:24 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-14 20:28 [PATCH v1 0/3] DRM RAS Fixes Raag Jadav
2026-05-14 20:28 ` [PATCH v1 1/3] drm/ras: Cancel and free message on get counter failure Raag Jadav
2026-05-16 0:24 ` Claude review: " Claude Code Review Bot
2026-05-14 20:28 ` [PATCH v1 2/3] drm/xe/drm_ras: Make counter allocation drm managed Raag Jadav
2026-05-16 0:24 ` Claude review: " Claude Code Review Bot
2026-05-14 20:28 ` [PATCH v1 3/3] drm/xe/drm_ras: Add per node cleanup action Raag Jadav
2026-05-16 0:24 ` Claude review: " Claude Code Review Bot
2026-05-16 0:24 ` Claude review: DRM RAS Fixes Claude Code Review Bot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox