public inbox for drm-ai-reviews@public-inbox.freedesktop.org
 help / color / mirror / Atom feed
* Re: [PATCH] drm/amdgpu: fix PASID task_info lookup race
       [not found] <20260309160403.599472-1-fanwu01@zju.edu.cn>
@ 2026-03-09 16:54 ` Christian König
  2026-03-10  0:58   ` [PATCH v2] " Fan Wu
                     ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Christian König @ 2026-03-09 16:54 UTC (permalink / raw)
  To: Fan Wu, Alex Deucher; +Cc: amd-gfx, dri-devel, linux-kernel

On 3/9/26 17:04, Fan Wu wrote:
> The amdgpu_vm_get_task_info_pasid() function previously called
> amdgpu_vm_get_vm_from_pasid() which returns a raw VM pointer after
> releasing the pasids xarray lock. The caller then dereferences
> vm->task_info without any lifetime protection.
> 
> Race condition:
> 
>     CPU 0 (lookup)                      CPU 1 (release)
>     ------------------                  ------------------
>     amdgpu_vm_get_task_info_pasid()
>       xa_lock()
>       vm = xa_load(pasids)
>       xa_unlock()
>                                         amdgpu_vm_fini()
>                                           xa_erase_irq(pasids)
>                                           // teardown continues
>                                         kfree(fpriv)
>                                         // VM freed (embedded in fpriv)
>       vm->task_info  // potential UAF
> 
> This can leave the VM pointer dangling because struct amdgpu_vm is
> embedded in struct amdgpu_fpriv which is freed via kfree(fpriv) in
> amdgpu_file_release_kms() after amdgpu_vm_fini() returns.
> 
> Fix this by acquiring the task_info reference while holding the
> xarray lock. This avoids the window where the VM could be freed
> between the lookup and the dereference.
> 
> Cache vm->task_info in a local variable before attempting to take a
> reference, which keeps the lookup straightforward inside the locked
> section. Use kref_get_unless_zero() to safely handle the case where
> task_info's refcount is already being decremented to zero by another
> thread in the teardown path.
> 
> Note: An RCU-based approach was considered but is not currently
> feasible because: (1) the pasids xarray is initialized without
> XA_FLAGS_RCU, and (2) struct amdgpu_fpriv is freed with kfree()
> rather than kfree_rcu(). A future refactoring could enable RCU
> if needed for performance.
> 
> Also remove the unsafe helper function amdgpu_vm_get_vm_from_pasid()
> to prevent future misuse.
> 
> Fixes: b8f67b9ddf4f ("drm/amdgpu: change vm->task_info handling")
> Signed-off-by: Fan Wu <fanwu01@zju.edu.cn>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 40 ++++++++++++++++----------
>  1 file changed, 25 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index f2beb980e3c3..7e8621c9b661 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -2468,19 +2468,6 @@ static void amdgpu_vm_destroy_task_info(struct kref *kref)
>         kfree(ti);
>  }
> 
> -static inline struct amdgpu_vm *
> -amdgpu_vm_get_vm_from_pasid(struct amdgpu_device *adev, u32 pasid)
> -{
> -       struct amdgpu_vm *vm;
> -       unsigned long flags;
> -
> -       xa_lock_irqsave(&adev->vm_manager.pasids, flags);
> -       vm = xa_load(&adev->vm_manager.pasids, pasid);
> -       xa_unlock_irqrestore(&adev->vm_manager.pasids, flags);
> -
> -       return vm;
> -}
> -
>  /**
>   * amdgpu_vm_put_task_info - reference down the vm task_info ptr
>   *
> @@ -2527,8 +2514,31 @@ amdgpu_vm_get_task_info_vm(struct amdgpu_vm *vm)
>  struct amdgpu_task_info *
>  amdgpu_vm_get_task_info_pasid(struct amdgpu_device *adev, u32 pasid)
>  {
> -       return amdgpu_vm_get_task_info_vm(
> -                       amdgpu_vm_get_vm_from_pasid(adev, pasid));
> +       struct amdgpu_vm *vm;
> +       unsigned long flags;
> +       struct amdgpu_task_info *ti = NULL;
> +
> +       /*
> +        * Acquire the task_info reference while holding the pasids xarray
> +        * lock to prevent a race with amdgpu_vm_fini() which removes the
> +        * PASID mapping before freeing the VM (embedded in struct amdgpu_fpriv).
> +        * Without this, the VM could be freed between xa_load() return and
> +        * the task_info dereference.

That the VM is freed is irrelevant, the point is that we need to grab the reference to the task info before we drop that one.

> +        */
> +       xa_lock_irqsave(&adev->vm_manager.pasids, flags);
> +       vm = xa_load(&adev->vm_manager.pasids, pasid);
> +       if (vm) {
> +               /*
> +                * Cache vm->task_info in a local variable before
> +                * attempting to take a reference.
> +                */

Please drop that comment, taking the task info into a local variable is actually superflous.

> +               ti = vm->task_info;
> +               if (ti && !kref_get_unless_zero(&ti->refcount))

That is unecessary as wel, the task info is dropped after the VM is removed from pasid mapping.

So just using kref_get() is sufficient.

Regards,
Christian.

> +                       ti = NULL;
> +       }
> +       xa_unlock_irqrestore(&adev->vm_manager.pasids, flags);
> +
> +       return ti;
>  }
> 
>  static int amdgpu_vm_create_task_info(struct amdgpu_vm *vm)
> --
> 2.34.1
> 


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v2] drm/amdgpu: fix PASID task_info lookup race
  2026-03-09 16:54 ` [PATCH] drm/amdgpu: fix PASID task_info lookup race Christian König
@ 2026-03-10  0:58   ` Fan Wu
  2026-03-11  3:51   ` Claude review: Re: [PATCH] " Claude Code Review Bot
  2026-03-11  3:51   ` Claude Code Review Bot
  2 siblings, 0 replies; 4+ messages in thread
From: Fan Wu @ 2026-03-10  0:58 UTC (permalink / raw)
  To: Alex Deucher, Christian König
  Cc: David Airlie, Simona Vetter, amd-gfx, dri-devel, linux-kernel,
	Fan Wu

amdgpu_vm_get_task_info_pasid() currently looks up the VM from the
PASID xarray, drops the xarray lock, and only then grabs the task_info
reference through amdgpu_vm_get_task_info_vm().

Take the task_info reference directly while holding the PASID xarray
lock instead. This keeps the lookup and reference acquisition in the
same critical section.

The task_info is dropped only after the VM is removed from the PASID
mapping, so a regular kref_get() is sufficient here.

Also remove the now unnecessary amdgpu_vm_get_vm_from_pasid() helper.

Fixes: b8f67b9ddf4f ("drm/amdgpu: change vm->task_info handling")
Signed-off-by: Fan Wu <fanwu01@zju.edu.cn>
---
v2:
- grab the task_info reference directly under the PASID xarray lock
- drop the extra local-variable comment
- use kref_get() instead of kref_get_unless_zero()
- simplify the changelog

 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 35 +++++++++++++++-----------
 1 file changed, 20 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index f2beb980e3c3..706ca8dd65d2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2468,19 +2468,6 @@ static void amdgpu_vm_destroy_task_info(struct kref *kref)
 	kfree(ti);
 }
 
-static inline struct amdgpu_vm *
-amdgpu_vm_get_vm_from_pasid(struct amdgpu_device *adev, u32 pasid)
-{
-	struct amdgpu_vm *vm;
-	unsigned long flags;
-
-	xa_lock_irqsave(&adev->vm_manager.pasids, flags);
-	vm = xa_load(&adev->vm_manager.pasids, pasid);
-	xa_unlock_irqrestore(&adev->vm_manager.pasids, flags);
-
-	return vm;
-}
-
 /**
  * amdgpu_vm_put_task_info - reference down the vm task_info ptr
  *
@@ -2527,8 +2514,26 @@ amdgpu_vm_get_task_info_vm(struct amdgpu_vm *vm)
 struct amdgpu_task_info *
 amdgpu_vm_get_task_info_pasid(struct amdgpu_device *adev, u32 pasid)
 {
-	return amdgpu_vm_get_task_info_vm(
-			amdgpu_vm_get_vm_from_pasid(adev, pasid));
+	struct amdgpu_vm *vm;
+	unsigned long flags;
+	struct amdgpu_task_info *ti = NULL;
+
+	/*
+	 * Acquire the task_info reference while holding the pasids xarray
+	 * lock to prevent a race with amdgpu_vm_fini() which removes the
+	 * PASID mapping before freeing the VM (embedded in struct amdgpu_fpriv).
+	 * Without this, the VM could be freed between xa_load() return and
+	 * the task_info dereference.
+	 */
+	xa_lock_irqsave(&adev->vm_manager.pasids, flags);
+	vm = xa_load(&adev->vm_manager.pasids, pasid);
+	if (vm && vm->task_info) {
+		ti = vm->task_info;
+		kref_get(&ti->refcount);
+	}
+	xa_unlock_irqrestore(&adev->vm_manager.pasids, flags);
+
+	return ti;
 }
 
 static int amdgpu_vm_create_task_info(struct amdgpu_vm *vm)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Claude review: Re: [PATCH] drm/amdgpu: fix PASID task_info lookup race
  2026-03-09 16:54 ` [PATCH] drm/amdgpu: fix PASID task_info lookup race Christian König
  2026-03-10  0:58   ` [PATCH v2] " Fan Wu
@ 2026-03-11  3:51   ` Claude Code Review Bot
  2026-03-11  3:51   ` Claude Code Review Bot
  2 siblings, 0 replies; 4+ messages in thread
From: Claude Code Review Bot @ 2026-03-11  3:51 UTC (permalink / raw)
  To: dri-devel-reviews

Overall Series Review

Subject: Re: [PATCH] drm/amdgpu: fix PASID task_info lookup race
Author: =?UTF-8?Q?Christian_K=C3=B6nig?= <christian.koenig@amd.com>
Patches: 2
Reviewed: 2026-03-11T13:51:08.979104

---

This is a single-patch fix for a real use-after-free race condition in `amdgpu_vm_get_task_info_pasid()`. The race is between the PASID lookup path and `amdgpu_vm_fini()`:

1. Thread A calls `amdgpu_vm_get_vm_from_pasid()`, gets a pointer to the VM, releases the xarray lock.
2. Thread B runs `amdgpu_vm_fini()`, calls `xa_erase_irq()` to remove the PASID mapping, then eventually calls `amdgpu_vm_put_task_info()` which may free `task_info`.
3. Thread A dereferences `vm->task_info` — use-after-free.

The fix correctly collapses the lookup and reference acquisition into a single critical section under the xarray lock. The approach is sound and well-motivated.

**Verdict: The patch is correct and should be accepted, with minor observations below.**

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Claude review: Re: [PATCH] drm/amdgpu: fix PASID task_info lookup race
  2026-03-09 16:54 ` [PATCH] drm/amdgpu: fix PASID task_info lookup race Christian König
  2026-03-10  0:58   ` [PATCH v2] " Fan Wu
  2026-03-11  3:51   ` Claude review: Re: [PATCH] " Claude Code Review Bot
@ 2026-03-11  3:51   ` Claude Code Review Bot
  2 siblings, 0 replies; 4+ messages in thread
From: Claude Code Review Bot @ 2026-03-11  3:51 UTC (permalink / raw)
  To: dri-devel-reviews

Patch Review

**Correctness of the fix:**

The core change is correct. By holding `xa_lock_irqsave` across both the `xa_load()` and the `kref_get()`, the race with `amdgpu_vm_fini()` is closed. In `amdgpu_vm_fini()` (line 2784-2786), `xa_erase_irq()` takes the same xarray lock internally, so the PASID entry cannot be removed while we hold the lock, guaranteeing the VM (and its `task_info`) remains valid.

**Use of `kref_get()` vs `kref_get_unless_zero()`:**

The commit message states:

> The task_info is dropped only after the VM is removed from the PASID mapping, so a regular kref_get() is sufficient here.

This is correct. Looking at `amdgpu_vm_fini()`, `xa_erase_irq()` happens at line 2785, while `amdgpu_vm_put_task_info(vm->task_info)` happens at line 2841 — well after the PASID is erased. Since the xarray lock serializes against the erase, if `xa_load()` returns a non-NULL VM, the task_info refcount is guaranteed to be >= 1. `kref_get()` is the right choice.

**NULL check improvement:**

The new code:
```c
if (vm && vm->task_info) {
    ti = vm->task_info;
    kref_get(&ti->refcount);
}
```

This adds a `vm->task_info` NULL check that the old `amdgpu_vm_get_task_info_vm()` lacked — the old code would crash if `vm->task_info` were NULL. This is a subtle improvement.

**Comment quality:**

The inline comment explaining the race is well-written and helpful for future maintainers.

**Minor observation — `amdgpu_vm_get_task_info_vm` left with same bug:**

The remaining `amdgpu_vm_get_task_info_vm()` function (lines 2510-2520) still unconditionally dereferences `vm->task_info` without a NULL check:
```c
if (vm) {
    ti = vm->task_info;
    kref_get(&vm->task_info->refcount);  /* crashes if task_info is NULL */
}
```
If `task_info` can be NULL (which this patch implicitly acknowledges by checking it), then `amdgpu_vm_get_task_info_vm()` has the same potential NULL dereference. This is a pre-existing issue and not introduced by this patch, but it might be worth fixing in a follow-up. The callers of `amdgpu_vm_get_task_info_vm()` likely always have a valid `task_info`, so it may be benign in practice.

**No other concerns.** The patch is clean, minimal, and fixes a real race condition.

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-03-11  3:51 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260309160403.599472-1-fanwu01@zju.edu.cn>
2026-03-09 16:54 ` [PATCH] drm/amdgpu: fix PASID task_info lookup race Christian König
2026-03-10  0:58   ` [PATCH v2] " Fan Wu
2026-03-11  3:51   ` Claude review: Re: [PATCH] " Claude Code Review Bot
2026-03-11  3:51   ` Claude Code Review Bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox