public inbox for drm-ai-reviews@public-inbox.freedesktop.org
 help / color / mirror / Atom feed
* [PATCH v2] drm/amdgpu: fix root reservation in amdgpu_vm_handle_fault
@ 2026-04-20 12:14 Pierre-Eric Pelloux-Prayer
  2026-04-20 12:29 ` Christian König
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Pierre-Eric Pelloux-Prayer @ 2026-04-20 12:14 UTC (permalink / raw)
  To: Alex Deucher, Christian König, David Airlie, Simona Vetter,
	Pierre-Eric Pelloux-Prayer
  Cc: amd-gfx, dri-devel, linux-kernel

svm_range_restore_pages might reserve the root bo so it must
be called after unreserving it.

---
v2:
  - don't modify amdgpu_vm_lock_by_pasid
  - add a TODO
---

Fixes: 32b486e8541c ("drm/amdgpu: extract amdgpu_vm_lock_by_pasid from amdgpu_vm_handle_fault")
Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 63156289ae7f..799a1803d941 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -3026,11 +3026,22 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, u32 pasid,
 
 	is_compute_context = vm->is_compute_context;
 
-	if (is_compute_context && !svm_range_restore_pages(adev, pasid, vmid,
-	    node_id, addr >> PAGE_SHIFT, ts, write_fault)) {
+	if (is_compute_context) {
+		/* Unreserve root since svm_range_restore_pages might try to reserve it. */
+		/* TODO: rework svm_range_restore_pages so that this isn't necessary. */
 		amdgpu_bo_unreserve(root);
+
+		if (!svm_range_restore_pages(adev, pasid, vmid,
+					     node_id, addr >> PAGE_SHIFT, ts, write_fault)) {
+			amdgpu_bo_unref(&root);
+			return true;
+		}
 		amdgpu_bo_unref(&root);
-		return true;
+
+		/* Double check that the VM still exists. */
+		vm = amdgpu_vm_lock_by_pasid(adev, &root, pasid);
+		if (!vm)
+			return false;
 	}
 
 	addr /= AMDGPU_GPU_PAGE_SIZE;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] drm/amdgpu: fix root reservation in amdgpu_vm_handle_fault
  2026-04-20 12:14 [PATCH v2] drm/amdgpu: fix root reservation in amdgpu_vm_handle_fault Pierre-Eric Pelloux-Prayer
@ 2026-04-20 12:29 ` Christian König
  2026-04-22 23:58 ` Claude review: " Claude Code Review Bot
  2026-04-22 23:58 ` Claude Code Review Bot
  2 siblings, 0 replies; 4+ messages in thread
From: Christian König @ 2026-04-20 12:29 UTC (permalink / raw)
  To: Pierre-Eric Pelloux-Prayer, Alex Deucher, David Airlie,
	Simona Vetter
  Cc: amd-gfx, dri-devel, linux-kernel

On 4/20/26 14:14, Pierre-Eric Pelloux-Prayer wrote:
> svm_range_restore_pages might reserve the root bo so it must
> be called after unreserving it.
> 
> ---
> v2:
>   - don't modify amdgpu_vm_lock_by_pasid
>   - add a TODO
> ---
> 
> Fixes: 32b486e8541c ("drm/amdgpu: extract amdgpu_vm_lock_by_pasid from amdgpu_vm_handle_fault")
> Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 17 ++++++++++++++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 63156289ae7f..799a1803d941 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -3026,11 +3026,22 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, u32 pasid,
>  
>  	is_compute_context = vm->is_compute_context;
>  
> -	if (is_compute_context && !svm_range_restore_pages(adev, pasid, vmid,
> -	    node_id, addr >> PAGE_SHIFT, ts, write_fault)) {
> +	if (is_compute_context) {
> +		/* Unreserve root since svm_range_restore_pages might try to reserve it. */
> +		/* TODO: rework svm_range_restore_pages so that this isn't necessary. */
>  		amdgpu_bo_unreserve(root);
> +
> +		if (!svm_range_restore_pages(adev, pasid, vmid,
> +					     node_id, addr >> PAGE_SHIFT, ts, write_fault)) {
> +			amdgpu_bo_unref(&root);
> +			return true;
> +		}
>  		amdgpu_bo_unref(&root);
> -		return true;
> +
> +		/* Double check that the VM still exists. */

Probably better to write "Re-acquire the VM lock, could be that the VM was freed in between.".

With that done Reviewed-by: Christian König <christian.koenig@amd.com>.

> +		vm = amdgpu_vm_lock_by_pasid(adev, &root, pasid);
> +		if (!vm)
> +			return false;
>  	}
>  
>  	addr /= AMDGPU_GPU_PAGE_SIZE;


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Claude review: drm/amdgpu: fix root reservation in amdgpu_vm_handle_fault
  2026-04-20 12:14 [PATCH v2] drm/amdgpu: fix root reservation in amdgpu_vm_handle_fault Pierre-Eric Pelloux-Prayer
  2026-04-20 12:29 ` Christian König
@ 2026-04-22 23:58 ` Claude Code Review Bot
  2026-04-22 23:58 ` Claude Code Review Bot
  2 siblings, 0 replies; 4+ messages in thread
From: Claude Code Review Bot @ 2026-04-22 23:58 UTC (permalink / raw)
  To: dri-devel-reviews

Overall Series Review

Subject: drm/amdgpu: fix root reservation in amdgpu_vm_handle_fault
Author: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Patches: 2
Reviewed: 2026-04-23T09:58:50.258731

---

This is a single-patch fix for a real deadlock in `amdgpu_vm_handle_fault`. The bug: `amdgpu_vm_lock_by_pasid` acquires the root BO reservation (via `amdgpu_bo_reserve`), then calls `svm_range_restore_pages`, which eventually reaches `svm_range_reserve_bos` → `amdgpu_vm_lock_pd`, trying to lock the same root BO via `drm_exec`. Since the first lock is held without a `ww_acquire_ctx` (NULL) and the second attempts to lock with a `drm_exec` context, the same thread deadlocks on the `ww_mutex`.

The fix — unreserve root before calling `svm_range_restore_pages`, then re-acquire via `amdgpu_vm_lock_by_pasid` if needed — is the correct minimal approach. The unlock/relock window is properly guarded by the re-validation in `amdgpu_vm_lock_by_pasid` (which re-checks the pasid→vm mapping and verifies `vm->root.bo` still matches).

**Verdict: The patch is correct and should be safe to apply.** One minor concern noted below about a missing `Cc: stable` tag.

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Claude review: drm/amdgpu: fix root reservation in amdgpu_vm_handle_fault
  2026-04-20 12:14 [PATCH v2] drm/amdgpu: fix root reservation in amdgpu_vm_handle_fault Pierre-Eric Pelloux-Prayer
  2026-04-20 12:29 ` Christian König
  2026-04-22 23:58 ` Claude review: " Claude Code Review Bot
@ 2026-04-22 23:58 ` Claude Code Review Bot
  2 siblings, 0 replies; 4+ messages in thread
From: Claude Code Review Bot @ 2026-04-22 23:58 UTC (permalink / raw)
  To: dri-devel-reviews

Patch Review

**The deadlock path (confirmed by code reading):**
1. `amdgpu_vm_handle_fault` → `amdgpu_vm_lock_by_pasid` → `amdgpu_bo_reserve(root, true)` — acquires `dma_resv` lock with NULL `ww_acquire_ctx`
2. → `svm_range_restore_pages` → `svm_range_validate_and_map` → `svm_range_reserve_bos` → `amdgpu_vm_lock_pd(vm, &ctx->exec, 2)` — tries to lock the same root BO via `drm_exec` with a non-NULL `ww_acquire_ctx`
3. Same thread, same mutex, two different acquire contexts → deadlock

**Correctness of the fix:**

The new code structure:
```c
if (is_compute_context) {
    amdgpu_bo_unreserve(root);
    if (!svm_range_restore_pages(...)) {
        amdgpu_bo_unref(&root);
        return true;
    }
    amdgpu_bo_unref(&root);
    vm = amdgpu_vm_lock_by_pasid(adev, &root, pasid);
    if (!vm)
        return false;
}
```

- **Success path** (`svm_range_restore_pages` returns 0): unreserve, unref, return true — correctly matches original behavior.
- **Failure path** (non-zero return): unreserve + unref the old root, then re-acquire a fresh lock+ref via `amdgpu_vm_lock_by_pasid`. Falls through to the NORETRY PTE installation path at line 3037 (`flags = AMDGPU_VM_NORETRY_FLAGS`), which needs root reserved — satisfied by the fresh lock.
- **VM destruction race**: Between unreserve/unref and the re-lock call, the VM could be torn down. `amdgpu_vm_lock_by_pasid` handles this correctly — it re-looks up the pasid, re-refs the root BO, re-reserves it, and double-checks that `vm->root.bo` still matches (lines 2979–2983 in the existing code).
- **`amdgpu_bo_unref(&root)`** nulls out `root`, so the subsequent `amdgpu_vm_lock_by_pasid` writes to a clean pointer. No use-after-free risk.

**Minor observations:**

1. **Missing `Cc: stable@vger.kernel.org`**: The patch has a `Fixes:` tag pointing to commit `32b486e8541c` but no `Cc: stable`. If this deadlock is reproducible in practice (e.g., SVM/XNACK workloads on compute contexts), it should probably carry a stable tag.

2. **Two-line comment block**: The `/* Unreserve root ... */` and `/* TODO: ... */` are two separate single-line comments. A single multi-line comment would be more conventional kernel style, but this is purely cosmetic.

3. **The TODO is well-placed**: The unlock/relock pattern is inherently fragile — if future code adds state between the initial lock and the `is_compute_context` check, it could be invalidated by the relock. The TODO acknowledging that `svm_range_restore_pages` should be reworked to not need the root reservation (or to accept it already held) is appropriate guidance for a future cleanup.

4. **Behavioral change is intentional and correct**: In the original code, for compute contexts where `svm_range_restore_pages` fails, the function falls through with the original lock still held. In the new code, it drops and re-acquires the lock. This is equivalent — the re-acquired lock protects the same critical section, and the `vm` pointer is refreshed.

Overall this is a clean, minimal fix for a real deadlock. Reviewed-by worthy.

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-04-22 23:58 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-20 12:14 [PATCH v2] drm/amdgpu: fix root reservation in amdgpu_vm_handle_fault Pierre-Eric Pelloux-Prayer
2026-04-20 12:29 ` Christian König
2026-04-22 23:58 ` Claude review: " Claude Code Review Bot
2026-04-22 23:58 ` Claude Code Review Bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox