Claude review: drm/xe/userptr: Defer Waiting for TLB invalidation to the second pass if possible

public inbox for drm-ai-reviews@public-inbox.freedesktop.org
 help / color / mirror / Atom feed

* Claude review: drm/xe/userptr: Defer Waiting for TLB invalidation to the second pass if possible
  2026-03-02 16:32 ` [PATCH v2 4/4] drm/xe/userptr: Defer Waiting for TLB invalidation to the second pass if possible Thomas Hellström
@ 2026-03-03  3:05   ` Claude Code Review Bot
  0 siblings, 0 replies; 13+ messages in thread
From: Claude Code Review Bot @ 2026-03-03  3:05 UTC (permalink / raw)
  To: dri-devel-reviews

Patch Review

This is the payoff patch. It extends the two-pass model to also defer TLB invalidation waits. However, the implementation effectively creates a **three-pass** flow using a **two-pass** infrastructure, which is somewhat surprising.

**Three-pass flow via two-pass API:**

The commit message describes three flows:
```
  pass1 (fences pending)  -> invalidate_finish -> do_inval (sync TLB)
  pass1 (fences done)     -> do_inval -> invalidate_finish -> complete_tlb_inval (deferred TLB)
  pass1 (finish occupied) -> do_inval (sync TLB, inline)
```

In the second flow, `xe_vma_userptr_do_inval` returns `&userptr->finish`, which means it's requesting another finish callback. But this finish pointer goes through `xe_vma_userptr_invalidate_finish` which checks `tlb_inval_submitted` and dispatches to `xe_vma_userptr_complete_tlb_inval`. This works because `xe_vma_userptr_do_inval` is called from `xe_vma_userptr_invalidate_finish` which is called from `mn_itree_finish_pass`.

**Wait - this doesn't actually achieve the third pass.** Looking at `mn_itree_finish_pass`:

```c
static void mn_itree_finish_pass(struct llist_head *finish_passes)
{
	struct llist_node *first = llist_reverse_order(__llist_del_all(finish_passes));
	struct mmu_interval_notifier_finish *f, *next;

	llist_for_each_entry_safe(f, next, first, link)
		f->notifier->ops->invalidate_finish(f);
}
```

The `invalidate_finish` callback can return a new `finish` pointer from `xe_vma_userptr_do_inval`, but **nobody collects it**. The finish callback is `void` - it doesn't return a pointer. So the flow where `xe_vma_userptr_do_inval` returns `&userptr->finish` from within the `invalidate_finish` callback is problematic: the returned pointer goes to the caller `xe_vma_userptr_invalidate_finish`, which ignores it (the function is `void`).

**Wait, re-reading `xe_vma_userptr_invalidate_finish`:**

```c
static void xe_vma_userptr_invalidate_finish(struct mmu_interval_notifier_finish *finish)
{
	...
	down_write(&vm->svm.gpusvm.notifier_lock);
	if (uvma->userptr.tlb_inval_submitted)
		xe_vma_userptr_complete_tlb_inval(vm, uvma);
	else
		xe_vma_userptr_do_inval(vm, uvma, true);
	up_write(&vm->svm.gpusvm.notifier_lock);
	...
}
```

When `tlb_inval_submitted` is false (the common case for the "fences pending -> finish" path), it calls `xe_vma_userptr_do_inval(vm, uvma, true)`. That function now returns a `struct mmu_interval_notifier_finish *`, but the return value is **ignored** here. If `xe_vma_userptr_do_inval` returns a non-NULL finish (because it submitted TLB invalidation asynchronously and set `finish_inuse = true`, `tlb_inval_submitted = true`), then:

1. The TLB invalidation is submitted but never waited on
2. `finish_inuse` is set to true and never cleared
3. `tlb_inval_submitted` is set to true and never cleared

**This appears to be a bug.** The deferred TLB path from within `invalidate_finish` has no mechanism to schedule another finish callback because the two-pass infrastructure only supports... two passes. The three-pass flow described in the commit message doesn't actually work as described.

Let me re-examine more carefully... Actually, looking again at `xe_vma_userptr_do_inval`:

```c
	if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
		if (!userptr->finish_inuse) {
			userptr->finish_inuse = true;
			userptr->tlb_inval_submitted = true;
			err = xe_vm_invalidate_vma_submit(vma, &userptr->inval_batch);
			XE_WARN_ON(err);
			return &userptr->finish;
		}
		err = xe_vm_invalidate_vma(vma);
		XE_WARN_ON(err);
	}

	if (is_deferred)
		userptr->finish_inuse = false;
```

When called from `xe_vma_userptr_invalidate_finish` (i.e., `is_deferred = true`), `finish_inuse` is still `true` (it was set in pass1). So the `!userptr->finish_inuse` check fails, and it falls through to the synchronous `xe_vm_invalidate_vma` path, then clears `finish_inuse`. So the three-pass flow **doesn't actually happen** for the "fences pending" case. The deferred TLB path only triggers when `signaled == true` in pass1 (all fences already done, so do_inval is called inline in pass1).

So the actual flows are:
1. **Fences pending**: pass1 defers -> finish calls do_inval -> `finish_inuse` is true -> falls through to sync TLB -> returns NULL (ignored, but correctly)
2. **Fences done, finish_inuse == false**: pass1 calls do_inval inline -> do_inval sees `!finish_inuse` -> submits TLB async -> returns finish ptr -> pass1 returns this to notifier core -> finish calls `complete_tlb_inval`
3. **Fences done, finish_inuse == true**: pass1 calls do_inval inline -> falls through to sync TLB -> returns NULL

This is actually correct! But flow #2 means the "two-pass" from the MMU notifier perspective is: pass1 (enable sw signaling + submit TLB) -> pass2 (wait for TLB). The dma_resv wait doesn't happen because fences are already signaled. My initial analysis was wrong about the bug.

However, this is still confusing. The `is_deferred` parameter controls whether `finish_inuse` is cleared. In flow #2, `is_deferred = false` (called from pass1), so `finish_inuse` is NOT cleared at the bottom of `do_inval`. It remains true, and is cleared later in `xe_vma_userptr_complete_tlb_inval`. This is correct but tricky.

**`xe_vma_userptr_force_invalidate` chain in patch 4:**

```c
	finish = xe_vma_userptr_invalidate_pass1(vm, uvma);
	if (finish)
		finish = xe_vma_userptr_do_inval(vm, uvma, true);
	if (finish)
		xe_vma_userptr_complete_tlb_inval(vm, uvma);
```

Note that `finish` here is the `static` variable from patch 2 (still a bug). But the logic is: if pass1 deferred, do the inval; if that deferred too, complete TLB. This correctly handles all three steps synchronously for the force-invalidate testing path.

**Embedding `xe_tlb_inval_batch` in `xe_userptr`:**

```c
	struct xe_tlb_inval_batch inval_batch;
```

This embeds 4 `xe_tlb_inval_fence` structs (~400-500 bytes) in every `xe_userptr`. This is a significant memory overhead per-VMA. Consider whether this should be dynamically allocated when needed, or whether the cost is justified by avoiding allocation in the notifier callback path.

**Summary of issues in patch 4:**

1. The `static` variable from patch 2 is still present (bug, should be fixed in patch 2)
2. The code flow is complex and the commit message's three-flow description is somewhat misleading about when each flow actually triggers
3. The return value of `xe_vma_userptr_do_inval` is ignored in `xe_vma_userptr_invalidate_finish` - this is actually correct behavior (it will always return NULL in that context because `finish_inuse` is already set), but deserves a comment
4. Per-VMA memory overhead from embedding `xe_tlb_inval_batch`

---

**Summary of actionable items:**

1. **Bug (Patch 2):** Remove `static` from `finish` variable in `xe_vma_userptr_force_invalidate`
2. **Minor (Patch 1):** Consider adding `WARN_ON` if both `invalidate` and `invalidate_start` are set in ops during registration
3. **Minor (Patch 3):** Inconsistent local variable naming (`_batch` vs `batch`)
4. **Minor (Patch 3):** `xe_tlb_inval_types.h` including `xe_device_types.h` is heavy for a types header
5. **Documentation (Patch 4):** Add a comment in `xe_vma_userptr_invalidate_finish` explaining that `xe_vma_userptr_do_inval` always returns NULL when called from the finish path (because `finish_inuse` is already set)
6. **Design consideration (Patch 4):** Memory cost of embedding `xe_tlb_inval_batch` in every `xe_userptr`

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v3 0/4] Two-pass MMU interval notifiers
@ 2026-03-03 13:34 Thomas Hellström
  2026-03-03 13:34 ` [PATCH v3 1/4] mm/mmu_notifier: Allow two-pass struct mmu_interval_notifiers Thomas Hellström
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Thomas Hellström @ 2026-03-03 13:34 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Matthew Brost, Jason Gunthorpe,
	Andrew Morton, Simona Vetter, Dave Airlie, Alistair Popple,
	dri-devel, linux-mm, linux-kernel, Christian König

GPU use-cases for mmu_interval_notifiers with hmm often involve
starting a gpu operation and then waiting for it to complete.
These operations are typically context preemption or TLB flushing.

With single-pass notifiers per GPU this doesn't scale in
multi-gpu scenarios. In those scenarios we'd want to first start
preemption- or TLB flushing on all GPUs and as a second pass wait
for them to complete.

This also applies in non-recoverable page-fault scenarios to
starting a preemption requests on GPUs and waiting for the GPUs 
to preempt so that system pages they access can be reclaimed.

One can do this on per-driver basis multiplexing per-driver
notifiers but that would mean sharing the notifier "user" lock
across all GPUs and that doesn't scale well either, so adding support
for two-pass in the core appears like the right choice.

So this series does that, with pach 1 implementing the core support
and also describes the choices made.

The rest of the patches implements a POC with xeKMD userptr
invalidation and potential TLB-flushing. A follow-up series
will extend to drm_gpusvm.

v2 hightlights:
- Refactor the core mm patch to use the struct
  mmu_interval_notifier_ops for the invalidate_finish() callback.
- Rebase on xe driver tlb invalidation changes.
- Provide an initial implementation for userptr instead of drm_gpusvm.
  The intent is to handle drm_gpusvm in a follow-up series.

v3:
- Address review comments from Matt Brost: Code formatting,
  documentation, additional asserts and removal of
  unnecessary waits, as specified in each patch.

Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Simona Vetter <simona.vetter@ffwll.ch>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: <dri-devel@lists.freedesktop.org>
Cc: <linux-mm@kvack.org>
Cc: <linux-kernel@vger.kernel.org>

Thomas Hellström (4):
  mm/mmu_notifier: Allow two-pass struct mmu_interval_notifiers
  drm/xe/userptr: Convert invalidation to two-pass MMU notifier
  drm/xe: Split TLB invalidation into submit and wait steps
  drm/xe/userptr: Defer Waiting for TLB invalidation to the second pass
    if possible

 drivers/gpu/drm/xe/xe_svm.c             |   8 +-
 drivers/gpu/drm/xe/xe_tlb_inval.c       |  84 +++++++++++++
 drivers/gpu/drm/xe/xe_tlb_inval.h       |   6 +
 drivers/gpu/drm/xe/xe_tlb_inval_types.h |  14 +++
 drivers/gpu/drm/xe/xe_userptr.c         | 155 ++++++++++++++++++++----
 drivers/gpu/drm/xe/xe_userptr.h         |  31 ++++-
 drivers/gpu/drm/xe/xe_vm.c              |  99 +++++----------
 drivers/gpu/drm/xe/xe_vm.h              |   5 +-
 drivers/gpu/drm/xe/xe_vm_madvise.c      |  10 +-
 drivers/gpu/drm/xe/xe_vm_types.h        |   1 +
 include/linux/mmu_notifier.h            |  38 ++++++
 mm/mmu_notifier.c                       |  65 ++++++++--
 12 files changed, 412 insertions(+), 104 deletions(-)

-- 
2.53.0

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v3 1/4] mm/mmu_notifier: Allow two-pass struct mmu_interval_notifiers
  2026-03-03 13:34 [PATCH v3 0/4] Two-pass MMU interval notifiers Thomas Hellström
@ 2026-03-03 13:34 ` Thomas Hellström
  2026-03-03 21:09   ` Claude review: " Claude Code Review Bot
  2026-03-03 13:34 ` [PATCH v3 2/4] drm/xe/userptr: Convert invalidation to two-pass MMU notifier Thomas Hellström
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Thomas Hellström @ 2026-03-03 13:34 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Jason Gunthorpe, Andrew Morton,
	Simona Vetter, Dave Airlie, Alistair Popple, dri-devel, linux-mm,
	linux-kernel, Matthew Brost, Christian König

GPU use-cases for mmu_interval_notifiers with hmm often involve
starting a gpu operation and then waiting for it to complete.
These operations are typically context preemption or TLB flushing.

With single-pass notifiers per GPU this doesn't scale in
multi-gpu scenarios. In those scenarios we'd want to first start
preemption- or TLB flushing on all GPUs and as a second pass wait
for them to complete.

One can do this on per-driver basis multiplexing per-driver
notifiers but that would mean sharing the notifier "user" lock
across all GPUs and that doesn't scale well either, so adding support
for multi-pass in the core appears to be the right choice.

Implement two-pass capability in the mmu_interval_notifier. Use a
linked list for the final passes to minimize the impact for
use-cases that don't need the multi-pass functionality by avoiding
a second interval tree walk, and to be able to easily pass data
between the two passes.

v1:
- Restrict to two passes (Jason Gunthorpe)
- Improve on documentation (Jason Gunthorpe)
- Improve on function naming (Alistair Popple)
v2:
- Include the invalidate_finish() callback in the
  struct mmu_interval_notifier_ops.
- Update documentation (GitHub Copilot:claude-sonnet-4.6)
- Use lockless list for list management.
v3:
- Update kerneldoc for the struct mmu_interval_notifier_finish::list member
  (Matthew Brost)
- Add a WARN_ON_ONCE() checking for NULL invalidate_finish() op if
  if invalidate_start() is non-NULL. (Matthew Brost)

Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Simona Vetter <simona.vetter@ffwll.ch>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: <dri-devel@lists.freedesktop.org>
Cc: <linux-mm@kvack.org>
Cc: <linux-kernel@vger.kernel.org>

Assisted-by: GitHub Copilot:claude-sonnet-4.6 # Documentation only.
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 include/linux/mmu_notifier.h | 38 +++++++++++++++++++++
 mm/mmu_notifier.c            | 65 +++++++++++++++++++++++++++++++-----
 2 files changed, 94 insertions(+), 9 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 07a2bbaf86e9..37b683163235 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -233,16 +233,54 @@ struct mmu_notifier {
 	unsigned int users;
 };
 
+/**
+ * struct mmu_interval_notifier_finish - mmu_interval_notifier two-pass abstraction
+ * @link: Lockless list link for the notifiers pending pass list
+ * @notifier: The mmu_interval_notifier for which the finish pass is called.
+ *
+ * Allocate, typically using GFP_NOWAIT in the interval notifier's first pass.
+ * If allocation fails (which is not unlikely under memory pressure), fall back
+ * to single-pass operation. Note that with a large number of notifiers
+ * implementing two passes, allocation with GFP_NOWAIT will become increasingly
+ * likely to fail, so consider implementing a small pool instead of using
+ * kmalloc() allocations.
+ *
+ * If the implementation needs to pass data between the two passes,
+ * the recommended way is to embed struct mmu_interval_notifier_finish into a larger
+ * structure that also contains the data needed to be shared. Keep in mind that
+ * a notifier callback can be invoked in parallel, and each invocation needs its
+ * own struct mmu_interval_notifier_finish.
+ */
+struct mmu_interval_notifier_finish {
+	struct llist_node link;
+	struct mmu_interval_notifier *notifier;
+};
+
 /**
  * struct mmu_interval_notifier_ops
  * @invalidate: Upon return the caller must stop using any SPTEs within this
  *              range. This function can sleep. Return false only if sleeping
  *              was required but mmu_notifier_range_blockable(range) is false.
+ * @invalidate_start: Similar to @invalidate, but intended for two-pass notifier
+ *                    callbacks where the call to @invalidate_start is the first
+ *                    pass and any struct mmu_interval_notifier_finish pointer
+ *                    returned in the @finish parameter describes the final pass.
+ *                    If @finish is %NULL on return, then no final pass will be
+ *                    called.
+ * @invalidate_finish: Called as the second pass for any notifier that returned
+ *                     a non-NULL @finish from @invalidate_start. The @finish
+ *                     pointer passed here is the same one returned by
+ *                     @invalidate_start.
  */
 struct mmu_interval_notifier_ops {
 	bool (*invalidate)(struct mmu_interval_notifier *interval_sub,
 			   const struct mmu_notifier_range *range,
 			   unsigned long cur_seq);
+	bool (*invalidate_start)(struct mmu_interval_notifier *interval_sub,
+				 const struct mmu_notifier_range *range,
+				 unsigned long cur_seq,
+				 struct mmu_interval_notifier_finish **finish);
+	void (*invalidate_finish)(struct mmu_interval_notifier_finish *finish);
 };
 
 struct mmu_interval_notifier {
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index a6cdf3674bdc..4d8a64ce8eda 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -260,6 +260,15 @@ mmu_interval_read_begin(struct mmu_interval_notifier *interval_sub)
 }
 EXPORT_SYMBOL_GPL(mmu_interval_read_begin);
 
+static void mn_itree_finish_pass(struct llist_head *finish_passes)
+{
+	struct llist_node *first = llist_reverse_order(__llist_del_all(finish_passes));
+	struct mmu_interval_notifier_finish *f, *next;
+
+	llist_for_each_entry_safe(f, next, first, link)
+		f->notifier->ops->invalidate_finish(f);
+}
+
 static void mn_itree_release(struct mmu_notifier_subscriptions *subscriptions,
 			     struct mm_struct *mm)
 {
@@ -271,6 +280,7 @@ static void mn_itree_release(struct mmu_notifier_subscriptions *subscriptions,
 		.end = ULONG_MAX,
 	};
 	struct mmu_interval_notifier *interval_sub;
+	LLIST_HEAD(finish_passes);
 	unsigned long cur_seq;
 	bool ret;
 
@@ -278,11 +288,27 @@ static void mn_itree_release(struct mmu_notifier_subscriptions *subscriptions,
 		     mn_itree_inv_start_range(subscriptions, &range, &cur_seq);
 	     interval_sub;
 	     interval_sub = mn_itree_inv_next(interval_sub, &range)) {
-		ret = interval_sub->ops->invalidate(interval_sub, &range,
-						    cur_seq);
+		if (interval_sub->ops->invalidate_start) {
+			struct mmu_interval_notifier_finish *finish = NULL;
+
+			ret = interval_sub->ops->invalidate_start(interval_sub,
+								  &range,
+								  cur_seq,
+								  &finish);
+			if (ret && finish) {
+				finish->notifier = interval_sub;
+				__llist_add(&finish->link, &finish_passes);
+			}
+
+		} else {
+			ret = interval_sub->ops->invalidate(interval_sub,
+							    &range,
+							    cur_seq);
+		}
 		WARN_ON(!ret);
 	}
 
+	mn_itree_finish_pass(&finish_passes);
 	mn_itree_inv_end(subscriptions);
 }
 
@@ -430,7 +456,9 @@ static int mn_itree_invalidate(struct mmu_notifier_subscriptions *subscriptions,
 			       const struct mmu_notifier_range *range)
 {
 	struct mmu_interval_notifier *interval_sub;
+	LLIST_HEAD(finish_passes);
 	unsigned long cur_seq;
+	int err = 0;
 
 	for (interval_sub =
 		     mn_itree_inv_start_range(subscriptions, range, &cur_seq);
@@ -438,23 +466,41 @@ static int mn_itree_invalidate(struct mmu_notifier_subscriptions *subscriptions,
 	     interval_sub = mn_itree_inv_next(interval_sub, range)) {
 		bool ret;
 
-		ret = interval_sub->ops->invalidate(interval_sub, range,
-						    cur_seq);
+		if (interval_sub->ops->invalidate_start) {
+			struct mmu_interval_notifier_finish *finish = NULL;
+
+			ret = interval_sub->ops->invalidate_start(interval_sub,
+								  range,
+								  cur_seq,
+								  &finish);
+			if (ret && finish) {
+				finish->notifier = interval_sub;
+				__llist_add(&finish->link, &finish_passes);
+			}
+
+		} else {
+			ret = interval_sub->ops->invalidate(interval_sub,
+							    range,
+							    cur_seq);
+		}
 		if (!ret) {
 			if (WARN_ON(mmu_notifier_range_blockable(range)))
 				continue;
-			goto out_would_block;
+			err = -EAGAIN;
+			break;
 		}
 	}
-	return 0;
 
-out_would_block:
+	mn_itree_finish_pass(&finish_passes);
+
 	/*
 	 * On -EAGAIN the non-blocking caller is not allowed to call
 	 * invalidate_range_end()
 	 */
-	mn_itree_inv_end(subscriptions);
-	return -EAGAIN;
+	if (err)
+		mn_itree_inv_end(subscriptions);
+
+	return err;
 }
 
 static int mn_hlist_invalidate_range_start(
@@ -976,6 +1022,7 @@ int mmu_interval_notifier_insert(struct mmu_interval_notifier *interval_sub,
 	struct mmu_notifier_subscriptions *subscriptions;
 	int ret;
 
+	WARN_ON_ONCE(ops->invalidate_start && !ops->invalidate_finish);
 	might_lock(&mm->mmap_lock);
 
 	subscriptions = smp_load_acquire(&mm->notifier_subscriptions);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 2/4] drm/xe/userptr: Convert invalidation to two-pass MMU notifier
  2026-03-03 13:34 [PATCH v3 0/4] Two-pass MMU interval notifiers Thomas Hellström
  2026-03-03 13:34 ` [PATCH v3 1/4] mm/mmu_notifier: Allow two-pass struct mmu_interval_notifiers Thomas Hellström
@ 2026-03-03 13:34 ` Thomas Hellström
  2026-03-03 18:10   ` Matthew Brost
  2026-03-03 21:09   ` Claude review: " Claude Code Review Bot
  2026-03-03 13:34 ` [PATCH v3 3/4] drm/xe: Split TLB invalidation into submit and wait steps Thomas Hellström
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 13+ messages in thread
From: Thomas Hellström @ 2026-03-03 13:34 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Matthew Brost, Christian König,
	dri-devel, Jason Gunthorpe, Andrew Morton, Simona Vetter,
	Dave Airlie, Alistair Popple, linux-mm, linux-kernel

In multi-GPU scenarios, asynchronous GPU job latency is a bottleneck if
each notifier waits for its own GPU before returning. The two-pass
mmu_interval_notifier infrastructure allows deferring the wait to a
second pass, so all GPUs can be signalled in the first pass before
any of them are waited on.

Convert the userptr invalidation to use the two-pass model:

Use invalidate_start as the first pass to mark the VMA for repin and
enable software signalling on the VM reservation fences to start any
gpu work needed for signaling. Fall back to completing the work
synchronously if all fences are already signalled, or if a concurrent
invalidation is already using the embedded finish structure.

Use invalidate_finish as the second pass to wait for the reservation
fences to complete, invalidate the GPU TLB in fault mode, and unmap
the gpusvm pages.

Embed a struct mmu_interval_notifier_finish in struct xe_userptr to
avoid dynamic allocation in the notifier callback. Use a finish_inuse
flag to prevent two concurrent invalidations from using it
simultaneously; fall back to the synchronous path for the second caller.

v3:
- Add locking asserts in notifier components (Matt Brost)
- Clean up newlines (Matt Brost)
- Update the userptr notifier state member locking documentation
  (Matt Brost)

Assisted-by: GitHub Copilot:claude-sonnet-4.6
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/xe/xe_userptr.c | 108 +++++++++++++++++++++++++-------
 drivers/gpu/drm/xe/xe_userptr.h |  14 ++++-
 2 files changed, 99 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_userptr.c b/drivers/gpu/drm/xe/xe_userptr.c
index e120323c43bc..37032b8125a6 100644
--- a/drivers/gpu/drm/xe/xe_userptr.c
+++ b/drivers/gpu/drm/xe/xe_userptr.c
@@ -10,6 +10,14 @@
 
 #include "xe_trace_bo.h"
 
+static void xe_userptr_assert_in_notifier(struct xe_vm *vm)
+{
+	lockdep_assert(lockdep_is_held_type(&vm->svm.gpusvm.notifier_lock, 0) ||
+		       (lockdep_is_held(&vm->lock) &&
+			lockdep_is_held_type(&vm->svm.gpusvm.notifier_lock, 1) &&
+			dma_resv_held(xe_vm_resv(vm))));
+}
+
 /**
  * xe_vma_userptr_check_repin() - Advisory check for repin needed
  * @uvma: The userptr vma
@@ -73,18 +81,46 @@ int xe_vma_userptr_pin_pages(struct xe_userptr_vma *uvma)
 				    &ctx);
 }
 
-static void __vma_userptr_invalidate(struct xe_vm *vm, struct xe_userptr_vma *uvma)
+static void xe_vma_userptr_do_inval(struct xe_vm *vm, struct xe_userptr_vma *uvma,
+				    bool is_deferred)
 {
 	struct xe_userptr *userptr = &uvma->userptr;
 	struct xe_vma *vma = &uvma->vma;
-	struct dma_resv_iter cursor;
-	struct dma_fence *fence;
 	struct drm_gpusvm_ctx ctx = {
 		.in_notifier = true,
 		.read_only = xe_vma_read_only(vma),
 	};
 	long err;
 
+	xe_userptr_assert_in_notifier(vm);
+
+	err = dma_resv_wait_timeout(xe_vm_resv(vm),
+				    DMA_RESV_USAGE_BOOKKEEP,
+				    false, MAX_SCHEDULE_TIMEOUT);
+	XE_WARN_ON(err <= 0);
+
+	if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
+		err = xe_vm_invalidate_vma(vma);
+		XE_WARN_ON(err);
+	}
+
+	if (is_deferred)
+		userptr->finish_inuse = false;
+	drm_gpusvm_unmap_pages(&vm->svm.gpusvm, &uvma->userptr.pages,
+			       xe_vma_size(vma) >> PAGE_SHIFT, &ctx);
+}
+
+static struct mmu_interval_notifier_finish *
+xe_vma_userptr_invalidate_pass1(struct xe_vm *vm, struct xe_userptr_vma *uvma)
+{
+	struct xe_userptr *userptr = &uvma->userptr;
+	struct xe_vma *vma = &uvma->vma;
+	struct dma_resv_iter cursor;
+	struct dma_fence *fence;
+	bool signaled = true;
+
+	xe_userptr_assert_in_notifier(vm);
+
 	/*
 	 * Tell exec and rebind worker they need to repin and rebind this
 	 * userptr.
@@ -105,27 +141,32 @@ static void __vma_userptr_invalidate(struct xe_vm *vm, struct xe_userptr_vma *uv
 	 */
 	dma_resv_iter_begin(&cursor, xe_vm_resv(vm),
 			    DMA_RESV_USAGE_BOOKKEEP);
-	dma_resv_for_each_fence_unlocked(&cursor, fence)
+	dma_resv_for_each_fence_unlocked(&cursor, fence) {
 		dma_fence_enable_sw_signaling(fence);
+		if (signaled && !dma_fence_is_signaled(fence))
+			signaled = false;
+	}
 	dma_resv_iter_end(&cursor);
 
-	err = dma_resv_wait_timeout(xe_vm_resv(vm),
-				    DMA_RESV_USAGE_BOOKKEEP,
-				    false, MAX_SCHEDULE_TIMEOUT);
-	XE_WARN_ON(err <= 0);
-
-	if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
-		err = xe_vm_invalidate_vma(vma);
-		XE_WARN_ON(err);
+	/*
+	 * Only one caller at a time can use the multi-pass state.
+	 * If it's already in use, or all fences are already signaled,
+	 * proceed directly to invalidation without deferring.
+	 */
+	if (signaled || userptr->finish_inuse) {
+		xe_vma_userptr_do_inval(vm, uvma, false);
+		return NULL;
 	}
 
-	drm_gpusvm_unmap_pages(&vm->svm.gpusvm, &uvma->userptr.pages,
-			       xe_vma_size(vma) >> PAGE_SHIFT, &ctx);
+	userptr->finish_inuse = true;
+
+	return &userptr->finish;
 }
 
-static bool vma_userptr_invalidate(struct mmu_interval_notifier *mni,
-				   const struct mmu_notifier_range *range,
-				   unsigned long cur_seq)
+static bool xe_vma_userptr_invalidate_start(struct mmu_interval_notifier *mni,
+					    const struct mmu_notifier_range *range,
+					    unsigned long cur_seq,
+					    struct mmu_interval_notifier_finish **p_finish)
 {
 	struct xe_userptr_vma *uvma = container_of(mni, typeof(*uvma), userptr.notifier);
 	struct xe_vma *vma = &uvma->vma;
@@ -138,21 +179,40 @@ static bool vma_userptr_invalidate(struct mmu_interval_notifier *mni,
 		return false;
 
 	vm_dbg(&xe_vma_vm(vma)->xe->drm,
-	       "NOTIFIER: addr=0x%016llx, range=0x%016llx",
+	       "NOTIFIER PASS1: addr=0x%016llx, range=0x%016llx",
 		xe_vma_start(vma), xe_vma_size(vma));
 
 	down_write(&vm->svm.gpusvm.notifier_lock);
 	mmu_interval_set_seq(mni, cur_seq);
 
-	__vma_userptr_invalidate(vm, uvma);
+	*p_finish = xe_vma_userptr_invalidate_pass1(vm, uvma);
+
 	up_write(&vm->svm.gpusvm.notifier_lock);
-	trace_xe_vma_userptr_invalidate_complete(vma);
+	if (!*p_finish)
+		trace_xe_vma_userptr_invalidate_complete(vma);
 
 	return true;
 }
 
+static void xe_vma_userptr_invalidate_finish(struct mmu_interval_notifier_finish *finish)
+{
+	struct xe_userptr_vma *uvma = container_of(finish, typeof(*uvma), userptr.finish);
+	struct xe_vma *vma = &uvma->vma;
+	struct xe_vm *vm = xe_vma_vm(vma);
+
+	vm_dbg(&xe_vma_vm(vma)->xe->drm,
+	       "NOTIFIER PASS2: addr=0x%016llx, range=0x%016llx",
+		xe_vma_start(vma), xe_vma_size(vma));
+
+	down_write(&vm->svm.gpusvm.notifier_lock);
+	xe_vma_userptr_do_inval(vm, uvma, true);
+	up_write(&vm->svm.gpusvm.notifier_lock);
+	trace_xe_vma_userptr_invalidate_complete(vma);
+}
+
 static const struct mmu_interval_notifier_ops vma_userptr_notifier_ops = {
-	.invalidate = vma_userptr_invalidate,
+	.invalidate_start = xe_vma_userptr_invalidate_start,
+	.invalidate_finish = xe_vma_userptr_invalidate_finish,
 };
 
 #if IS_ENABLED(CONFIG_DRM_XE_USERPTR_INVAL_INJECT)
@@ -164,6 +224,7 @@ static const struct mmu_interval_notifier_ops vma_userptr_notifier_ops = {
  */
 void xe_vma_userptr_force_invalidate(struct xe_userptr_vma *uvma)
 {
+	static struct mmu_interval_notifier_finish *finish;
 	struct xe_vm *vm = xe_vma_vm(&uvma->vma);
 
 	/* Protect against concurrent userptr pinning */
@@ -179,7 +240,10 @@ void xe_vma_userptr_force_invalidate(struct xe_userptr_vma *uvma)
 	if (!mmu_interval_read_retry(&uvma->userptr.notifier,
 				     uvma->userptr.pages.notifier_seq))
 		uvma->userptr.pages.notifier_seq -= 2;
-	__vma_userptr_invalidate(vm, uvma);
+
+	finish = xe_vma_userptr_invalidate_pass1(vm, uvma);
+	if (finish)
+		xe_vma_userptr_do_inval(vm, uvma, true);
 }
 #endif
 
diff --git a/drivers/gpu/drm/xe/xe_userptr.h b/drivers/gpu/drm/xe/xe_userptr.h
index ef801234991e..e1830c2f5fd2 100644
--- a/drivers/gpu/drm/xe/xe_userptr.h
+++ b/drivers/gpu/drm/xe/xe_userptr.h
@@ -56,7 +56,19 @@ struct xe_userptr {
 	 * @notifier: MMU notifier for user pointer (invalidation call back)
 	 */
 	struct mmu_interval_notifier notifier;
-
+	/**
+	 * @finish: MMU notifier finish structure for two-pass invalidation.
+	 * Embedded here to avoid allocation in the notifier callback.
+	 * Protected by struct xe_vm::svm.gpusvm.notifier_lock in write mode
+	 * alternatively by the same lock in read mode *and* the vm resv held.
+	 */
+	struct mmu_interval_notifier_finish finish;
+	/**
+	 * @finish_inuse: Whether @finish is currently in use by an in-progress
+	 * two-pass invalidation.
+	 * Protected using the same locking as @finish.
+	 */
+	bool finish_inuse;
 	/**
 	 * @initial_bind: user pointer has been bound at least once.
 	 * write: vm->svm.gpusvm.notifier_lock in read mode and vm->resv held.
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 3/4] drm/xe: Split TLB invalidation into submit and wait steps
  2026-03-03 13:34 [PATCH v3 0/4] Two-pass MMU interval notifiers Thomas Hellström
  2026-03-03 13:34 ` [PATCH v3 1/4] mm/mmu_notifier: Allow two-pass struct mmu_interval_notifiers Thomas Hellström
  2026-03-03 13:34 ` [PATCH v3 2/4] drm/xe/userptr: Convert invalidation to two-pass MMU notifier Thomas Hellström
@ 2026-03-03 13:34 ` Thomas Hellström
  2026-03-03 18:13   ` Matthew Brost
  2026-03-03 21:09   ` Claude review: " Claude Code Review Bot
  2026-03-03 13:34 ` [PATCH v3 4/4] drm/xe/userptr: Defer Waiting for TLB invalidation to the second pass if possible Thomas Hellström
  2026-03-03 21:09 ` Claude review: Two-pass MMU interval notifiers Claude Code Review Bot
  4 siblings, 2 replies; 13+ messages in thread
From: Thomas Hellström @ 2026-03-03 13:34 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Matthew Brost, Christian König,
	dri-devel, Jason Gunthorpe, Andrew Morton, Simona Vetter,
	Dave Airlie, Alistair Popple, linux-mm, linux-kernel

xe_vm_range_tilemask_tlb_inval() submits TLB invalidation requests to
all GTs in a tile mask and then immediately waits for them to complete
before returning. This is fine for the existing callers, but a
subsequent patch will need to defer the wait in order to overlap TLB
invalidations across multiple VMAs.

Introduce xe_tlb_inval_range_tilemask_submit() and
xe_tlb_inval_batch_wait() in xe_tlb_inval.c as the submit and wait
halves respectively. The batch of fences is carried in the new
xe_tlb_inval_batch structure. Remove xe_vm_range_tilemask_tlb_inval()
and convert all three call sites to the new API.

v3:
- Don't wait on TLB invalidation batches if the corresponding batch
  submit returns an error. (Matt Brost)
- s/_batch/batch/ (Matt Brost)

Assisted-by: GitHub Copilot:claude-sonnet-4.6
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/xe/xe_svm.c             |  8 ++-
 drivers/gpu/drm/xe/xe_tlb_inval.c       | 84 +++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_tlb_inval.h       |  6 ++
 drivers/gpu/drm/xe/xe_tlb_inval_types.h | 14 +++++
 drivers/gpu/drm/xe/xe_vm.c              | 69 +++-----------------
 drivers/gpu/drm/xe/xe_vm.h              |  3 -
 drivers/gpu/drm/xe/xe_vm_madvise.c      | 10 ++-
 drivers/gpu/drm/xe/xe_vm_types.h        |  1 +
 8 files changed, 127 insertions(+), 68 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 002b6c22ad3f..a91c84487a67 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -19,6 +19,7 @@
 #include "xe_pt.h"
 #include "xe_svm.h"
 #include "xe_tile.h"
+#include "xe_tlb_inval.h"
 #include "xe_ttm_vram_mgr.h"
 #include "xe_vm.h"
 #include "xe_vm_types.h"
@@ -225,6 +226,7 @@ static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
 			      const struct mmu_notifier_range *mmu_range)
 {
 	struct xe_vm *vm = gpusvm_to_vm(gpusvm);
+	struct xe_tlb_inval_batch batch;
 	struct xe_device *xe = vm->xe;
 	struct drm_gpusvm_range *r, *first;
 	struct xe_tile *tile;
@@ -276,8 +278,10 @@ static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
 
 	xe_device_wmb(xe);
 
-	err = xe_vm_range_tilemask_tlb_inval(vm, adj_start, adj_end, tile_mask);
-	WARN_ON_ONCE(err);
+	err = xe_tlb_inval_range_tilemask_submit(xe, vm->usm.asid, adj_start, adj_end,
+						 tile_mask, &batch);
+	if (!WARN_ON_ONCE(err))
+		xe_tlb_inval_batch_wait(&batch);
 
 range_notifier_event_end:
 	r = first;
diff --git a/drivers/gpu/drm/xe/xe_tlb_inval.c b/drivers/gpu/drm/xe/xe_tlb_inval.c
index 933f30fb617d..10dcd4abb00f 100644
--- a/drivers/gpu/drm/xe/xe_tlb_inval.c
+++ b/drivers/gpu/drm/xe/xe_tlb_inval.c
@@ -486,3 +486,87 @@ bool xe_tlb_inval_idle(struct xe_tlb_inval *tlb_inval)
 	guard(spinlock_irq)(&tlb_inval->pending_lock);
 	return list_is_singular(&tlb_inval->pending_fences);
 }
+
+/**
+ * xe_tlb_inval_batch_wait() - Wait for all fences in a TLB invalidation batch
+ * @batch: Batch of TLB invalidation fences to wait on
+ *
+ * Waits for every fence in @batch to signal, then resets @batch so it can be
+ * reused for a subsequent invalidation.
+ */
+void xe_tlb_inval_batch_wait(struct xe_tlb_inval_batch *batch)
+{
+	struct xe_tlb_inval_fence *fence = &batch->fence[0];
+	unsigned int i;
+
+	for (i = 0; i < batch->num_fences; ++i)
+		xe_tlb_inval_fence_wait(fence++);
+
+	batch->num_fences = 0;
+}
+
+/**
+ * xe_tlb_inval_range_tilemask_submit() - Submit TLB invalidations for an
+ * address range on a tile mask
+ * @xe: The xe device
+ * @asid: Address space ID
+ * @start: start address
+ * @end: end address
+ * @tile_mask: mask for which gt's issue tlb invalidation
+ * @batch: Batch of tlb invalidate fences
+ *
+ * Issue a range based TLB invalidation for gt's in tilemask
+ * If the function returns an error, there is no need to call
+ * xe_tlb_inval_batch_wait() on @batch.
+ *
+ * Returns 0 for success, negative error code otherwise.
+ */
+int xe_tlb_inval_range_tilemask_submit(struct xe_device *xe, u32 asid,
+				       u64 start, u64 end, u8 tile_mask,
+				       struct xe_tlb_inval_batch *batch)
+{
+	struct xe_tlb_inval_fence *fence = &batch->fence[0];
+	struct xe_tile *tile;
+	u32 fence_id = 0;
+	u8 id;
+	int err;
+
+	batch->num_fences = 0;
+	if (!tile_mask)
+		return 0;
+
+	for_each_tile(tile, xe, id) {
+		if (!(tile_mask & BIT(id)))
+			continue;
+
+		xe_tlb_inval_fence_init(&tile->primary_gt->tlb_inval,
+					&fence[fence_id], true);
+
+		err = xe_tlb_inval_range(&tile->primary_gt->tlb_inval,
+					 &fence[fence_id], start, end,
+					 asid, NULL);
+		if (err)
+			goto wait;
+		++fence_id;
+
+		if (!tile->media_gt)
+			continue;
+
+		xe_tlb_inval_fence_init(&tile->media_gt->tlb_inval,
+					&fence[fence_id], true);
+
+		err = xe_tlb_inval_range(&tile->media_gt->tlb_inval,
+					 &fence[fence_id], start, end,
+					 asid, NULL);
+		if (err)
+			goto wait;
+		++fence_id;
+	}
+
+wait:
+	batch->num_fences = fence_id;
+	if (err)
+		xe_tlb_inval_batch_wait(batch);
+
+	return err;
+}
diff --git a/drivers/gpu/drm/xe/xe_tlb_inval.h b/drivers/gpu/drm/xe/xe_tlb_inval.h
index 62089254fa23..a76b7823a5f2 100644
--- a/drivers/gpu/drm/xe/xe_tlb_inval.h
+++ b/drivers/gpu/drm/xe/xe_tlb_inval.h
@@ -45,4 +45,10 @@ void xe_tlb_inval_done_handler(struct xe_tlb_inval *tlb_inval, int seqno);
 
 bool xe_tlb_inval_idle(struct xe_tlb_inval *tlb_inval);
 
+int xe_tlb_inval_range_tilemask_submit(struct xe_device *xe, u32 asid,
+				       u64 start, u64 end, u8 tile_mask,
+				       struct xe_tlb_inval_batch *batch);
+
+void xe_tlb_inval_batch_wait(struct xe_tlb_inval_batch *batch);
+
 #endif	/* _XE_TLB_INVAL_ */
diff --git a/drivers/gpu/drm/xe/xe_tlb_inval_types.h b/drivers/gpu/drm/xe/xe_tlb_inval_types.h
index 3b089f90f002..3d1797d186fd 100644
--- a/drivers/gpu/drm/xe/xe_tlb_inval_types.h
+++ b/drivers/gpu/drm/xe/xe_tlb_inval_types.h
@@ -9,6 +9,8 @@
 #include <linux/workqueue.h>
 #include <linux/dma-fence.h>
 
+#include "xe_device_types.h"
+
 struct drm_suballoc;
 struct xe_tlb_inval;
 
@@ -132,4 +134,16 @@ struct xe_tlb_inval_fence {
 	ktime_t inval_time;
 };
 
+/**
+ * struct xe_tlb_inval_batch - Batch of TLB invalidation fences
+ *
+ * Holds one fence per GT covered by a TLB invalidation request.
+ */
+struct xe_tlb_inval_batch {
+	/** @fence: per-GT TLB invalidation fences */
+	struct xe_tlb_inval_fence fence[XE_MAX_TILES_PER_DEVICE * XE_MAX_GT_PER_TILE];
+	/** @num_fences: number of valid entries in @fence */
+	unsigned int num_fences;
+};
+
 #endif
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 548b0769b3ef..a3c2e8cefec7 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -3966,66 +3966,6 @@ void xe_vm_unlock(struct xe_vm *vm)
 	dma_resv_unlock(xe_vm_resv(vm));
 }
 
-/**
- * xe_vm_range_tilemask_tlb_inval - Issue a TLB invalidation on this tilemask for an
- * address range
- * @vm: The VM
- * @start: start address
- * @end: end address
- * @tile_mask: mask for which gt's issue tlb invalidation
- *
- * Issue a range based TLB invalidation for gt's in tilemask
- *
- * Returns 0 for success, negative error code otherwise.
- */
-int xe_vm_range_tilemask_tlb_inval(struct xe_vm *vm, u64 start,
-				   u64 end, u8 tile_mask)
-{
-	struct xe_tlb_inval_fence
-		fence[XE_MAX_TILES_PER_DEVICE * XE_MAX_GT_PER_TILE];
-	struct xe_tile *tile;
-	u32 fence_id = 0;
-	u8 id;
-	int err;
-
-	if (!tile_mask)
-		return 0;
-
-	for_each_tile(tile, vm->xe, id) {
-		if (!(tile_mask & BIT(id)))
-			continue;
-
-		xe_tlb_inval_fence_init(&tile->primary_gt->tlb_inval,
-					&fence[fence_id], true);
-
-		err = xe_tlb_inval_range(&tile->primary_gt->tlb_inval,
-					 &fence[fence_id], start, end,
-					 vm->usm.asid, NULL);
-		if (err)
-			goto wait;
-		++fence_id;
-
-		if (!tile->media_gt)
-			continue;
-
-		xe_tlb_inval_fence_init(&tile->media_gt->tlb_inval,
-					&fence[fence_id], true);
-
-		err = xe_tlb_inval_range(&tile->media_gt->tlb_inval,
-					 &fence[fence_id], start, end,
-					 vm->usm.asid, NULL);
-		if (err)
-			goto wait;
-		++fence_id;
-	}
-
-wait:
-	for (id = 0; id < fence_id; ++id)
-		xe_tlb_inval_fence_wait(&fence[id]);
-
-	return err;
-}
-
 /**
  * xe_vm_invalidate_vma - invalidate GPU mappings for VMA without a lock
  * @vma: VMA to invalidate
@@ -4040,6 +3980,7 @@ int xe_vm_invalidate_vma(struct xe_vma *vma)
 {
 	struct xe_device *xe = xe_vma_vm(vma)->xe;
 	struct xe_vm *vm = xe_vma_vm(vma);
+	struct xe_tlb_inval_batch batch;
 	struct xe_tile *tile;
 	u8 tile_mask = 0;
 	int ret = 0;
@@ -4080,12 +4021,16 @@ int xe_vm_invalidate_vma(struct xe_vma *vma)
 
 	xe_device_wmb(xe);
 
-	ret = xe_vm_range_tilemask_tlb_inval(xe_vma_vm(vma), xe_vma_start(vma),
-					     xe_vma_end(vma), tile_mask);
+	ret = xe_tlb_inval_range_tilemask_submit(xe, xe_vma_vm(vma)->usm.asid,
+						 xe_vma_start(vma), xe_vma_end(vma),
+						 tile_mask, &batch);
 
 	/* WRITE_ONCE pairs with READ_ONCE in xe_vm_has_valid_gpu_mapping() */
 	WRITE_ONCE(vma->tile_invalidated, vma->tile_mask);
 
+	if (!ret)
+		xe_tlb_inval_batch_wait(&batch);
+
 	return ret;
 }
 
diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index f849e369432b..62f4b6fec0bc 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -240,9 +240,6 @@ struct dma_fence *xe_vm_range_rebind(struct xe_vm *vm,
 struct dma_fence *xe_vm_range_unbind(struct xe_vm *vm,
 				     struct xe_svm_range *range);
 
-int xe_vm_range_tilemask_tlb_inval(struct xe_vm *vm, u64 start,
-				   u64 end, u8 tile_mask);
-
 int xe_vm_invalidate_vma(struct xe_vma *vma);
 
 int xe_vm_validate_protected(struct xe_vm *vm);
diff --git a/drivers/gpu/drm/xe/xe_vm_madvise.c b/drivers/gpu/drm/xe/xe_vm_madvise.c
index 95bf53cc29e3..02daf8a93044 100644
--- a/drivers/gpu/drm/xe/xe_vm_madvise.c
+++ b/drivers/gpu/drm/xe/xe_vm_madvise.c
@@ -12,6 +12,7 @@
 #include "xe_pat.h"
 #include "xe_pt.h"
 #include "xe_svm.h"
+#include "xe_tlb_inval.h"
 
 struct xe_vmas_in_madvise_range {
 	u64 addr;
@@ -235,13 +236,20 @@ static u8 xe_zap_ptes_in_madvise_range(struct xe_vm *vm, u64 start, u64 end)
 static int xe_vm_invalidate_madvise_range(struct xe_vm *vm, u64 start, u64 end)
 {
 	u8 tile_mask = xe_zap_ptes_in_madvise_range(vm, start, end);
+	struct xe_tlb_inval_batch batch;
+	int err;
 
 	if (!tile_mask)
 		return 0;
 
 	xe_device_wmb(vm->xe);
 
-	return xe_vm_range_tilemask_tlb_inval(vm, start, end, tile_mask);
+	err = xe_tlb_inval_range_tilemask_submit(vm->xe, vm->usm.asid, start, end,
+						 tile_mask, &batch);
+	if (!err)
+		xe_tlb_inval_batch_wait(&batch);
+
+	return err;
 }
 
 static bool madvise_args_are_sane(struct xe_device *xe, const struct drm_xe_madvise *args)
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index 1f6f7e30e751..de6544165cfa 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -18,6 +18,7 @@
 #include "xe_device_types.h"
 #include "xe_pt_types.h"
 #include "xe_range_fence.h"
+#include "xe_tlb_inval_types.h"
 #include "xe_userptr.h"
 
 struct drm_pagemap;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 4/4] drm/xe/userptr: Defer Waiting for TLB invalidation to the second pass if possible
  2026-03-03 13:34 [PATCH v3 0/4] Two-pass MMU interval notifiers Thomas Hellström
                   ` (2 preceding siblings ...)
  2026-03-03 13:34 ` [PATCH v3 3/4] drm/xe: Split TLB invalidation into submit and wait steps Thomas Hellström
@ 2026-03-03 13:34 ` Thomas Hellström
  2026-03-03 21:09   ` Claude review: " Claude Code Review Bot
  2026-03-03 21:09 ` Claude review: Two-pass MMU interval notifiers Claude Code Review Bot
  4 siblings, 1 reply; 13+ messages in thread
From: Thomas Hellström @ 2026-03-03 13:34 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Matthew Brost, Christian König,
	dri-devel, Jason Gunthorpe, Andrew Morton, Simona Vetter,
	Dave Airlie, Alistair Popple, linux-mm, linux-kernel

Now that the two-pass notifier flow uses xe_vma_userptr_do_inval() for
the fence-wait + TLB-invalidate work, extend it to support a further
deferred TLB wait:

- xe_vma_userptr_do_inval(): when the embedded finish handle is free,
  submit the TLB invalidation asynchronously (xe_vm_invalidate_vma_submit)
  and return &userptr->finish so the mmu_notifier core schedules a third
  pass.  When the handle is occupied by a concurrent invalidation, fall
  back to the synchronous xe_vm_invalidate_vma() path.

- xe_vma_userptr_complete_tlb_inval(): new helper called from
  invalidate_finish when tlb_inval_submitted is set.  Waits for the
  previously submitted batch and unmaps the gpusvm pages.

xe_vma_userptr_invalidate_finish() dispatches between the two helpers
via tlb_inval_submitted, making the three possible flows explicit:

  pass1 (fences pending)  -> invalidate_finish -> do_inval (sync TLB)
  pass1 (fences done)     -> do_inval -> invalidate_finish
                          -> complete_tlb_inval (deferred TLB)
  pass1 (finish occupied) -> do_inval (sync TLB, inline)

In multi-GPU scenarios this allows TLB flushes to be submitted on all
GPUs in one pass before any of them are waited on.

Also adds xe_vm_invalidate_vma_submit() which submits the TLB range
invalidation without blocking, populating a xe_tlb_inval_batch that
the caller waits on separately.

v3:
- Add locking asserts and notifier state asserts (Matt Brost)
- Update the locking documentation of the notifier
  state members (Matt Brost)
- Remove unrelated code formatting changes (Matt Brost)

Assisted-by: GitHub Copilot:claude-sonnet-4.6
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/xe/xe_userptr.c | 63 ++++++++++++++++++++++++++++-----
 drivers/gpu/drm/xe/xe_userptr.h | 17 +++++++++
 drivers/gpu/drm/xe/xe_vm.c      | 38 +++++++++++++++-----
 drivers/gpu/drm/xe/xe_vm.h      |  2 ++
 4 files changed, 104 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_userptr.c b/drivers/gpu/drm/xe/xe_userptr.c
index 37032b8125a6..6761005c0b90 100644
--- a/drivers/gpu/drm/xe/xe_userptr.c
+++ b/drivers/gpu/drm/xe/xe_userptr.c
@@ -8,6 +8,7 @@
 
 #include <linux/mm.h>
 
+#include "xe_tlb_inval.h"
 #include "xe_trace_bo.h"
 
 static void xe_userptr_assert_in_notifier(struct xe_vm *vm)
@@ -81,8 +82,8 @@ int xe_vma_userptr_pin_pages(struct xe_userptr_vma *uvma)
 				    &ctx);
 }
 
-static void xe_vma_userptr_do_inval(struct xe_vm *vm, struct xe_userptr_vma *uvma,
-				    bool is_deferred)
+static struct mmu_interval_notifier_finish *
+xe_vma_userptr_do_inval(struct xe_vm *vm, struct xe_userptr_vma *uvma, bool is_deferred)
 {
 	struct xe_userptr *userptr = &uvma->userptr;
 	struct xe_vma *vma = &uvma->vma;
@@ -93,6 +94,8 @@ static void xe_vma_userptr_do_inval(struct xe_vm *vm, struct xe_userptr_vma *uvm
 	long err;
 
 	xe_userptr_assert_in_notifier(vm);
+	if (is_deferred)
+		xe_assert(vm->xe, userptr->finish_inuse && !userptr->tlb_inval_submitted);
 
 	err = dma_resv_wait_timeout(xe_vm_resv(vm),
 				    DMA_RESV_USAGE_BOOKKEEP,
@@ -100,6 +103,19 @@ static void xe_vma_userptr_do_inval(struct xe_vm *vm, struct xe_userptr_vma *uvm
 	XE_WARN_ON(err <= 0);
 
 	if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
+		if (!userptr->finish_inuse) {
+			/*
+			 * Defer the TLB wait to an extra pass so the caller
+			 * can pipeline TLB flushes across GPUs before waiting
+			 * on any of them.
+			 */
+			xe_assert(vm->xe, !userptr->tlb_inval_submitted);
+			userptr->finish_inuse = true;
+			userptr->tlb_inval_submitted = true;
+			err = xe_vm_invalidate_vma_submit(vma, &userptr->inval_batch);
+			XE_WARN_ON(err);
+			return &userptr->finish;
+		}
 		err = xe_vm_invalidate_vma(vma);
 		XE_WARN_ON(err);
 	}
@@ -108,6 +124,28 @@ static void xe_vma_userptr_do_inval(struct xe_vm *vm, struct xe_userptr_vma *uvm
 		userptr->finish_inuse = false;
 	drm_gpusvm_unmap_pages(&vm->svm.gpusvm, &uvma->userptr.pages,
 			       xe_vma_size(vma) >> PAGE_SHIFT, &ctx);
+	return NULL;
+}
+
+static void
+xe_vma_userptr_complete_tlb_inval(struct xe_vm *vm, struct xe_userptr_vma *uvma)
+{
+	struct xe_userptr *userptr = &uvma->userptr;
+	struct xe_vma *vma = &uvma->vma;
+	struct drm_gpusvm_ctx ctx = {
+		.in_notifier = true,
+		.read_only = xe_vma_read_only(vma),
+	};
+
+	xe_userptr_assert_in_notifier(vm);
+	xe_assert(vm->xe, userptr->finish_inuse);
+	xe_assert(vm->xe, userptr->tlb_inval_submitted);
+
+	xe_tlb_inval_batch_wait(&userptr->inval_batch);
+	userptr->tlb_inval_submitted = false;
+	userptr->finish_inuse = false;
+	drm_gpusvm_unmap_pages(&vm->svm.gpusvm, &uvma->userptr.pages,
+			       xe_vma_size(vma) >> PAGE_SHIFT, &ctx);
 }
 
 static struct mmu_interval_notifier_finish *
@@ -153,11 +191,10 @@ xe_vma_userptr_invalidate_pass1(struct xe_vm *vm, struct xe_userptr_vma *uvma)
 	 * If it's already in use, or all fences are already signaled,
 	 * proceed directly to invalidation without deferring.
 	 */
-	if (signaled || userptr->finish_inuse) {
-		xe_vma_userptr_do_inval(vm, uvma, false);
-		return NULL;
-	}
+	if (signaled || userptr->finish_inuse)
+		return xe_vma_userptr_do_inval(vm, uvma, false);
 
+	/* Defer: the notifier core will call invalidate_finish once done. */
 	userptr->finish_inuse = true;
 
 	return &userptr->finish;
@@ -205,7 +242,15 @@ static void xe_vma_userptr_invalidate_finish(struct mmu_interval_notifier_finish
 		xe_vma_start(vma), xe_vma_size(vma));
 
 	down_write(&vm->svm.gpusvm.notifier_lock);
-	xe_vma_userptr_do_inval(vm, uvma, true);
+	/*
+	 * If a TLB invalidation was previously submitted (deferred from the
+	 * synchronous pass1 fallback), wait for it and unmap pages.
+	 * Otherwise, fences have now completed: invalidate the TLB and unmap.
+	 */
+	if (uvma->userptr.tlb_inval_submitted)
+		xe_vma_userptr_complete_tlb_inval(vm, uvma);
+	else
+		xe_vma_userptr_do_inval(vm, uvma, true);
 	up_write(&vm->svm.gpusvm.notifier_lock);
 	trace_xe_vma_userptr_invalidate_complete(vma);
 }
@@ -243,7 +288,9 @@ void xe_vma_userptr_force_invalidate(struct xe_userptr_vma *uvma)
 
 	finish = xe_vma_userptr_invalidate_pass1(vm, uvma);
 	if (finish)
-		xe_vma_userptr_do_inval(vm, uvma, true);
+		finish = xe_vma_userptr_do_inval(vm, uvma, true);
+	if (finish)
+		xe_vma_userptr_complete_tlb_inval(vm, uvma);
 }
 #endif
 
diff --git a/drivers/gpu/drm/xe/xe_userptr.h b/drivers/gpu/drm/xe/xe_userptr.h
index e1830c2f5fd2..2a3cd1b5efbb 100644
--- a/drivers/gpu/drm/xe/xe_userptr.h
+++ b/drivers/gpu/drm/xe/xe_userptr.h
@@ -14,6 +14,8 @@
 
 #include <drm/drm_gpusvm.h>
 
+#include "xe_tlb_inval_types.h"
+
 struct xe_vm;
 struct xe_vma;
 struct xe_userptr_vma;
@@ -63,12 +65,27 @@ struct xe_userptr {
 	 * alternatively by the same lock in read mode *and* the vm resv held.
 	 */
 	struct mmu_interval_notifier_finish finish;
+	/**
+	 * @inval_batch: TLB invalidation batch for deferred completion.
+	 * Stores an in-flight TLB invalidation submitted during a two-pass
+	 * notifier so the wait can be deferred to a subsequent pass, allowing
+	 * multiple GPUs to be signalled before any of them are waited on.
+	 * Protected using the same locking as @finish.
+	 */
+	struct xe_tlb_inval_batch inval_batch;
 	/**
 	 * @finish_inuse: Whether @finish is currently in use by an in-progress
 	 * two-pass invalidation.
 	 * Protected using the same locking as @finish.
 	 */
 	bool finish_inuse;
+	/**
+	 * @tlb_inval_submitted: Whether a TLB invalidation has been submitted
+	 * via @inval_batch and is pending completion.  When set, the next pass
+	 * must call xe_tlb_inval_batch_wait() before reusing @inval_batch.
+	 * Protected using the same locking as @finish.
+	 */
+	bool tlb_inval_submitted;
 	/**
 	 * @initial_bind: user pointer has been bound at least once.
 	 * write: vm->svm.gpusvm.notifier_lock in read mode and vm->resv held.
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index a3c2e8cefec7..fdad9329dfb4 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -3967,20 +3967,23 @@ void xe_vm_unlock(struct xe_vm *vm)
 }
 
 /**
- * xe_vm_invalidate_vma - invalidate GPU mappings for VMA without a lock
+ * xe_vm_invalidate_vma_submit - Submit a job to invalidate GPU mappings for
+ * VMA.
  * @vma: VMA to invalidate
+ * @batch: TLB invalidation batch to populate; caller must later call
+ *         xe_tlb_inval_batch_wait() on it to wait for completion
  *
  * Walks a list of page tables leaves which it memset the entries owned by this
- * VMA to zero, invalidates the TLBs, and block until TLBs invalidation is
- * complete.
+ * VMA to zero, invalidates the TLBs, but doesn't block waiting for TLB flush
+ * to complete, but instead populates @batch which can be waited on using
+ * xe_tlb_inval_batch_wait().
  *
  * Returns 0 for success, negative error code otherwise.
  */
-int xe_vm_invalidate_vma(struct xe_vma *vma)
+int xe_vm_invalidate_vma_submit(struct xe_vma *vma, struct xe_tlb_inval_batch *batch)
 {
 	struct xe_device *xe = xe_vma_vm(vma)->xe;
 	struct xe_vm *vm = xe_vma_vm(vma);
-	struct xe_tlb_inval_batch batch;
 	struct xe_tile *tile;
 	u8 tile_mask = 0;
 	int ret = 0;
@@ -4023,14 +4026,33 @@ int xe_vm_invalidate_vma(struct xe_vma *vma)
 
 	ret = xe_tlb_inval_range_tilemask_submit(xe, xe_vma_vm(vma)->usm.asid,
 						 xe_vma_start(vma), xe_vma_end(vma),
-						 tile_mask, &batch);
+						 tile_mask, batch);
 
 	/* WRITE_ONCE pairs with READ_ONCE in xe_vm_has_valid_gpu_mapping() */
 	WRITE_ONCE(vma->tile_invalidated, vma->tile_mask);
+	return ret;
+}
+
+/**
+ * xe_vm_invalidate_vma - invalidate GPU mappings for VMA without a lock
+ * @vma: VMA to invalidate
+ *
+ * Walks a list of page tables leaves which it memset the entries owned by this
+ * VMA to zero, invalidates the TLBs, and block until TLBs invalidation is
+ * complete.
+ *
+ * Returns 0 for success, negative error code otherwise.
+ */
+int xe_vm_invalidate_vma(struct xe_vma *vma)
+{
+	struct xe_tlb_inval_batch batch;
+	int ret;
 
-	if (!ret)
-		xe_tlb_inval_batch_wait(&batch);
+	ret = xe_vm_invalidate_vma_submit(vma, &batch);
+	if (ret)
+		return ret;
 
+	xe_tlb_inval_batch_wait(&batch);
 	return ret;
 }
 
diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index 62f4b6fec0bc..0bc7ed23eeae 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -242,6 +242,8 @@ struct dma_fence *xe_vm_range_unbind(struct xe_vm *vm,
 
 int xe_vm_invalidate_vma(struct xe_vma *vma);
 
+int xe_vm_invalidate_vma_submit(struct xe_vma *vma, struct xe_tlb_inval_batch *batch);
+
 int xe_vm_validate_protected(struct xe_vm *vm);
 
 static inline void xe_vm_queue_rebind_worker(struct xe_vm *vm)
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 2/4] drm/xe/userptr: Convert invalidation to two-pass MMU notifier
  2026-03-03 13:34 ` [PATCH v3 2/4] drm/xe/userptr: Convert invalidation to two-pass MMU notifier Thomas Hellström
@ 2026-03-03 18:10   ` Matthew Brost
  2026-03-03 21:09   ` Claude review: " Claude Code Review Bot
  1 sibling, 0 replies; 13+ messages in thread
From: Matthew Brost @ 2026-03-03 18:10 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, Christian König, dri-devel, Jason Gunthorpe,
	Andrew Morton, Simona Vetter, Dave Airlie, Alistair Popple,
	linux-mm, linux-kernel

On Tue, Mar 03, 2026 at 02:34:07PM +0100, Thomas Hellström wrote:
> In multi-GPU scenarios, asynchronous GPU job latency is a bottleneck if
> each notifier waits for its own GPU before returning. The two-pass
> mmu_interval_notifier infrastructure allows deferring the wait to a
> second pass, so all GPUs can be signalled in the first pass before
> any of them are waited on.
> 
> Convert the userptr invalidation to use the two-pass model:
> 
> Use invalidate_start as the first pass to mark the VMA for repin and
> enable software signalling on the VM reservation fences to start any
> gpu work needed for signaling. Fall back to completing the work
> synchronously if all fences are already signalled, or if a concurrent
> invalidation is already using the embedded finish structure.
> 
> Use invalidate_finish as the second pass to wait for the reservation
> fences to complete, invalidate the GPU TLB in fault mode, and unmap
> the gpusvm pages.
> 
> Embed a struct mmu_interval_notifier_finish in struct xe_userptr to
> avoid dynamic allocation in the notifier callback. Use a finish_inuse
> flag to prevent two concurrent invalidations from using it
> simultaneously; fall back to the synchronous path for the second caller.
> 
> v3:
> - Add locking asserts in notifier components (Matt Brost)
> - Clean up newlines (Matt Brost)
> - Update the userptr notifier state member locking documentation
>   (Matt Brost)
> 
> Assisted-by: GitHub Copilot:claude-sonnet-4.6
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>

Reviewed-by: Matthew Brost <matthew.brost@intel.com>

> ---
>  drivers/gpu/drm/xe/xe_userptr.c | 108 +++++++++++++++++++++++++-------
>  drivers/gpu/drm/xe/xe_userptr.h |  14 ++++-
>  2 files changed, 99 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_userptr.c b/drivers/gpu/drm/xe/xe_userptr.c
> index e120323c43bc..37032b8125a6 100644
> --- a/drivers/gpu/drm/xe/xe_userptr.c
> +++ b/drivers/gpu/drm/xe/xe_userptr.c
> @@ -10,6 +10,14 @@
>  
>  #include "xe_trace_bo.h"
>  
> +static void xe_userptr_assert_in_notifier(struct xe_vm *vm)
> +{
> +	lockdep_assert(lockdep_is_held_type(&vm->svm.gpusvm.notifier_lock, 0) ||
> +		       (lockdep_is_held(&vm->lock) &&
> +			lockdep_is_held_type(&vm->svm.gpusvm.notifier_lock, 1) &&
> +			dma_resv_held(xe_vm_resv(vm))));
> +}
> +
>  /**
>   * xe_vma_userptr_check_repin() - Advisory check for repin needed
>   * @uvma: The userptr vma
> @@ -73,18 +81,46 @@ int xe_vma_userptr_pin_pages(struct xe_userptr_vma *uvma)
>  				    &ctx);
>  }
>  
> -static void __vma_userptr_invalidate(struct xe_vm *vm, struct xe_userptr_vma *uvma)
> +static void xe_vma_userptr_do_inval(struct xe_vm *vm, struct xe_userptr_vma *uvma,
> +				    bool is_deferred)
>  {
>  	struct xe_userptr *userptr = &uvma->userptr;
>  	struct xe_vma *vma = &uvma->vma;
> -	struct dma_resv_iter cursor;
> -	struct dma_fence *fence;
>  	struct drm_gpusvm_ctx ctx = {
>  		.in_notifier = true,
>  		.read_only = xe_vma_read_only(vma),
>  	};
>  	long err;
>  
> +	xe_userptr_assert_in_notifier(vm);
> +
> +	err = dma_resv_wait_timeout(xe_vm_resv(vm),
> +				    DMA_RESV_USAGE_BOOKKEEP,
> +				    false, MAX_SCHEDULE_TIMEOUT);
> +	XE_WARN_ON(err <= 0);
> +
> +	if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
> +		err = xe_vm_invalidate_vma(vma);
> +		XE_WARN_ON(err);
> +	}
> +
> +	if (is_deferred)
> +		userptr->finish_inuse = false;
> +	drm_gpusvm_unmap_pages(&vm->svm.gpusvm, &uvma->userptr.pages,
> +			       xe_vma_size(vma) >> PAGE_SHIFT, &ctx);
> +}
> +
> +static struct mmu_interval_notifier_finish *
> +xe_vma_userptr_invalidate_pass1(struct xe_vm *vm, struct xe_userptr_vma *uvma)
> +{
> +	struct xe_userptr *userptr = &uvma->userptr;
> +	struct xe_vma *vma = &uvma->vma;
> +	struct dma_resv_iter cursor;
> +	struct dma_fence *fence;
> +	bool signaled = true;
> +
> +	xe_userptr_assert_in_notifier(vm);
> +
>  	/*
>  	 * Tell exec and rebind worker they need to repin and rebind this
>  	 * userptr.
> @@ -105,27 +141,32 @@ static void __vma_userptr_invalidate(struct xe_vm *vm, struct xe_userptr_vma *uv
>  	 */
>  	dma_resv_iter_begin(&cursor, xe_vm_resv(vm),
>  			    DMA_RESV_USAGE_BOOKKEEP);
> -	dma_resv_for_each_fence_unlocked(&cursor, fence)
> +	dma_resv_for_each_fence_unlocked(&cursor, fence) {
>  		dma_fence_enable_sw_signaling(fence);
> +		if (signaled && !dma_fence_is_signaled(fence))
> +			signaled = false;
> +	}
>  	dma_resv_iter_end(&cursor);
>  
> -	err = dma_resv_wait_timeout(xe_vm_resv(vm),
> -				    DMA_RESV_USAGE_BOOKKEEP,
> -				    false, MAX_SCHEDULE_TIMEOUT);
> -	XE_WARN_ON(err <= 0);
> -
> -	if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
> -		err = xe_vm_invalidate_vma(vma);
> -		XE_WARN_ON(err);
> +	/*
> +	 * Only one caller at a time can use the multi-pass state.
> +	 * If it's already in use, or all fences are already signaled,
> +	 * proceed directly to invalidation without deferring.
> +	 */
> +	if (signaled || userptr->finish_inuse) {
> +		xe_vma_userptr_do_inval(vm, uvma, false);
> +		return NULL;
>  	}
>  
> -	drm_gpusvm_unmap_pages(&vm->svm.gpusvm, &uvma->userptr.pages,
> -			       xe_vma_size(vma) >> PAGE_SHIFT, &ctx);
> +	userptr->finish_inuse = true;
> +
> +	return &userptr->finish;
>  }
>  
> -static bool vma_userptr_invalidate(struct mmu_interval_notifier *mni,
> -				   const struct mmu_notifier_range *range,
> -				   unsigned long cur_seq)
> +static bool xe_vma_userptr_invalidate_start(struct mmu_interval_notifier *mni,
> +					    const struct mmu_notifier_range *range,
> +					    unsigned long cur_seq,
> +					    struct mmu_interval_notifier_finish **p_finish)
>  {
>  	struct xe_userptr_vma *uvma = container_of(mni, typeof(*uvma), userptr.notifier);
>  	struct xe_vma *vma = &uvma->vma;
> @@ -138,21 +179,40 @@ static bool vma_userptr_invalidate(struct mmu_interval_notifier *mni,
>  		return false;
>  
>  	vm_dbg(&xe_vma_vm(vma)->xe->drm,
> -	       "NOTIFIER: addr=0x%016llx, range=0x%016llx",
> +	       "NOTIFIER PASS1: addr=0x%016llx, range=0x%016llx",
>  		xe_vma_start(vma), xe_vma_size(vma));
>  
>  	down_write(&vm->svm.gpusvm.notifier_lock);
>  	mmu_interval_set_seq(mni, cur_seq);
>  
> -	__vma_userptr_invalidate(vm, uvma);
> +	*p_finish = xe_vma_userptr_invalidate_pass1(vm, uvma);
> +
>  	up_write(&vm->svm.gpusvm.notifier_lock);
> -	trace_xe_vma_userptr_invalidate_complete(vma);
> +	if (!*p_finish)
> +		trace_xe_vma_userptr_invalidate_complete(vma);
>  
>  	return true;
>  }
>  
> +static void xe_vma_userptr_invalidate_finish(struct mmu_interval_notifier_finish *finish)
> +{
> +	struct xe_userptr_vma *uvma = container_of(finish, typeof(*uvma), userptr.finish);
> +	struct xe_vma *vma = &uvma->vma;
> +	struct xe_vm *vm = xe_vma_vm(vma);
> +
> +	vm_dbg(&xe_vma_vm(vma)->xe->drm,
> +	       "NOTIFIER PASS2: addr=0x%016llx, range=0x%016llx",
> +		xe_vma_start(vma), xe_vma_size(vma));
> +
> +	down_write(&vm->svm.gpusvm.notifier_lock);
> +	xe_vma_userptr_do_inval(vm, uvma, true);
> +	up_write(&vm->svm.gpusvm.notifier_lock);
> +	trace_xe_vma_userptr_invalidate_complete(vma);
> +}
> +
>  static const struct mmu_interval_notifier_ops vma_userptr_notifier_ops = {
> -	.invalidate = vma_userptr_invalidate,
> +	.invalidate_start = xe_vma_userptr_invalidate_start,
> +	.invalidate_finish = xe_vma_userptr_invalidate_finish,
>  };
>  
>  #if IS_ENABLED(CONFIG_DRM_XE_USERPTR_INVAL_INJECT)
> @@ -164,6 +224,7 @@ static const struct mmu_interval_notifier_ops vma_userptr_notifier_ops = {
>   */
>  void xe_vma_userptr_force_invalidate(struct xe_userptr_vma *uvma)
>  {
> +	static struct mmu_interval_notifier_finish *finish;
>  	struct xe_vm *vm = xe_vma_vm(&uvma->vma);
>  
>  	/* Protect against concurrent userptr pinning */
> @@ -179,7 +240,10 @@ void xe_vma_userptr_force_invalidate(struct xe_userptr_vma *uvma)
>  	if (!mmu_interval_read_retry(&uvma->userptr.notifier,
>  				     uvma->userptr.pages.notifier_seq))
>  		uvma->userptr.pages.notifier_seq -= 2;
> -	__vma_userptr_invalidate(vm, uvma);
> +
> +	finish = xe_vma_userptr_invalidate_pass1(vm, uvma);
> +	if (finish)
> +		xe_vma_userptr_do_inval(vm, uvma, true);
>  }
>  #endif
>  
> diff --git a/drivers/gpu/drm/xe/xe_userptr.h b/drivers/gpu/drm/xe/xe_userptr.h
> index ef801234991e..e1830c2f5fd2 100644
> --- a/drivers/gpu/drm/xe/xe_userptr.h
> +++ b/drivers/gpu/drm/xe/xe_userptr.h
> @@ -56,7 +56,19 @@ struct xe_userptr {
>  	 * @notifier: MMU notifier for user pointer (invalidation call back)
>  	 */
>  	struct mmu_interval_notifier notifier;
> -
> +	/**
> +	 * @finish: MMU notifier finish structure for two-pass invalidation.
> +	 * Embedded here to avoid allocation in the notifier callback.
> +	 * Protected by struct xe_vm::svm.gpusvm.notifier_lock in write mode
> +	 * alternatively by the same lock in read mode *and* the vm resv held.
> +	 */
> +	struct mmu_interval_notifier_finish finish;
> +	/**
> +	 * @finish_inuse: Whether @finish is currently in use by an in-progress
> +	 * two-pass invalidation.
> +	 * Protected using the same locking as @finish.
> +	 */
> +	bool finish_inuse;
>  	/**
>  	 * @initial_bind: user pointer has been bound at least once.
>  	 * write: vm->svm.gpusvm.notifier_lock in read mode and vm->resv held.
> -- 
> 2.53.0
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 3/4] drm/xe: Split TLB invalidation into submit and wait steps
  2026-03-03 13:34 ` [PATCH v3 3/4] drm/xe: Split TLB invalidation into submit and wait steps Thomas Hellström
@ 2026-03-03 18:13   ` Matthew Brost
  2026-03-03 21:09   ` Claude review: " Claude Code Review Bot
  1 sibling, 0 replies; 13+ messages in thread
From: Matthew Brost @ 2026-03-03 18:13 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, Christian König, dri-devel, Jason Gunthorpe,
	Andrew Morton, Simona Vetter, Dave Airlie, Alistair Popple,
	linux-mm, linux-kernel

On Tue, Mar 03, 2026 at 02:34:08PM +0100, Thomas Hellström wrote:
> xe_vm_range_tilemask_tlb_inval() submits TLB invalidation requests to
> all GTs in a tile mask and then immediately waits for them to complete
> before returning. This is fine for the existing callers, but a
> subsequent patch will need to defer the wait in order to overlap TLB
> invalidations across multiple VMAs.
> 
> Introduce xe_tlb_inval_range_tilemask_submit() and
> xe_tlb_inval_batch_wait() in xe_tlb_inval.c as the submit and wait
> halves respectively. The batch of fences is carried in the new
> xe_tlb_inval_batch structure. Remove xe_vm_range_tilemask_tlb_inval()
> and convert all three call sites to the new API.
> 
> v3:
> - Don't wait on TLB invalidation batches if the corresponding batch
>   submit returns an error. (Matt Brost)
> - s/_batch/batch/ (Matt Brost)
> 
> Assisted-by: GitHub Copilot:claude-sonnet-4.6
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>

Reviewed-by: Matthew Brost <matthew.brost@intel.com>

> ---
>  drivers/gpu/drm/xe/xe_svm.c             |  8 ++-
>  drivers/gpu/drm/xe/xe_tlb_inval.c       | 84 +++++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_tlb_inval.h       |  6 ++
>  drivers/gpu/drm/xe/xe_tlb_inval_types.h | 14 +++++
>  drivers/gpu/drm/xe/xe_vm.c              | 69 +++-----------------
>  drivers/gpu/drm/xe/xe_vm.h              |  3 -
>  drivers/gpu/drm/xe/xe_vm_madvise.c      | 10 ++-
>  drivers/gpu/drm/xe/xe_vm_types.h        |  1 +
>  8 files changed, 127 insertions(+), 68 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
> index 002b6c22ad3f..a91c84487a67 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -19,6 +19,7 @@
>  #include "xe_pt.h"
>  #include "xe_svm.h"
>  #include "xe_tile.h"
> +#include "xe_tlb_inval.h"
>  #include "xe_ttm_vram_mgr.h"
>  #include "xe_vm.h"
>  #include "xe_vm_types.h"
> @@ -225,6 +226,7 @@ static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
>  			      const struct mmu_notifier_range *mmu_range)
>  {
>  	struct xe_vm *vm = gpusvm_to_vm(gpusvm);
> +	struct xe_tlb_inval_batch batch;
>  	struct xe_device *xe = vm->xe;
>  	struct drm_gpusvm_range *r, *first;
>  	struct xe_tile *tile;
> @@ -276,8 +278,10 @@ static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
>  
>  	xe_device_wmb(xe);
>  
> -	err = xe_vm_range_tilemask_tlb_inval(vm, adj_start, adj_end, tile_mask);
> -	WARN_ON_ONCE(err);
> +	err = xe_tlb_inval_range_tilemask_submit(xe, vm->usm.asid, adj_start, adj_end,
> +						 tile_mask, &batch);
> +	if (!WARN_ON_ONCE(err))
> +		xe_tlb_inval_batch_wait(&batch);
>  
>  range_notifier_event_end:
>  	r = first;
> diff --git a/drivers/gpu/drm/xe/xe_tlb_inval.c b/drivers/gpu/drm/xe/xe_tlb_inval.c
> index 933f30fb617d..10dcd4abb00f 100644
> --- a/drivers/gpu/drm/xe/xe_tlb_inval.c
> +++ b/drivers/gpu/drm/xe/xe_tlb_inval.c
> @@ -486,3 +486,87 @@ bool xe_tlb_inval_idle(struct xe_tlb_inval *tlb_inval)
>  	guard(spinlock_irq)(&tlb_inval->pending_lock);
>  	return list_is_singular(&tlb_inval->pending_fences);
>  }
> +
> +/**
> + * xe_tlb_inval_batch_wait() - Wait for all fences in a TLB invalidation batch
> + * @batch: Batch of TLB invalidation fences to wait on
> + *
> + * Waits for every fence in @batch to signal, then resets @batch so it can be
> + * reused for a subsequent invalidation.
> + */
> +void xe_tlb_inval_batch_wait(struct xe_tlb_inval_batch *batch)
> +{
> +	struct xe_tlb_inval_fence *fence = &batch->fence[0];
> +	unsigned int i;
> +
> +	for (i = 0; i < batch->num_fences; ++i)
> +		xe_tlb_inval_fence_wait(fence++);
> +
> +	batch->num_fences = 0;
> +}
> +
> +/**
> + * xe_tlb_inval_range_tilemask_submit() - Submit TLB invalidations for an
> + * address range on a tile mask
> + * @xe: The xe device
> + * @asid: Address space ID
> + * @start: start address
> + * @end: end address
> + * @tile_mask: mask for which gt's issue tlb invalidation
> + * @batch: Batch of tlb invalidate fences
> + *
> + * Issue a range based TLB invalidation for gt's in tilemask
> + * If the function returns an error, there is no need to call
> + * xe_tlb_inval_batch_wait() on @batch.
> + *
> + * Returns 0 for success, negative error code otherwise.
> + */
> +int xe_tlb_inval_range_tilemask_submit(struct xe_device *xe, u32 asid,
> +				       u64 start, u64 end, u8 tile_mask,
> +				       struct xe_tlb_inval_batch *batch)
> +{
> +	struct xe_tlb_inval_fence *fence = &batch->fence[0];
> +	struct xe_tile *tile;
> +	u32 fence_id = 0;
> +	u8 id;
> +	int err;
> +
> +	batch->num_fences = 0;
> +	if (!tile_mask)
> +		return 0;
> +
> +	for_each_tile(tile, xe, id) {
> +		if (!(tile_mask & BIT(id)))
> +			continue;
> +
> +		xe_tlb_inval_fence_init(&tile->primary_gt->tlb_inval,
> +					&fence[fence_id], true);
> +
> +		err = xe_tlb_inval_range(&tile->primary_gt->tlb_inval,
> +					 &fence[fence_id], start, end,
> +					 asid, NULL);
> +		if (err)
> +			goto wait;
> +		++fence_id;
> +
> +		if (!tile->media_gt)
> +			continue;
> +
> +		xe_tlb_inval_fence_init(&tile->media_gt->tlb_inval,
> +					&fence[fence_id], true);
> +
> +		err = xe_tlb_inval_range(&tile->media_gt->tlb_inval,
> +					 &fence[fence_id], start, end,
> +					 asid, NULL);
> +		if (err)
> +			goto wait;
> +		++fence_id;
> +	}
> +
> +wait:
> +	batch->num_fences = fence_id;
> +	if (err)
> +		xe_tlb_inval_batch_wait(batch);
> +
> +	return err;
> +}
> diff --git a/drivers/gpu/drm/xe/xe_tlb_inval.h b/drivers/gpu/drm/xe/xe_tlb_inval.h
> index 62089254fa23..a76b7823a5f2 100644
> --- a/drivers/gpu/drm/xe/xe_tlb_inval.h
> +++ b/drivers/gpu/drm/xe/xe_tlb_inval.h
> @@ -45,4 +45,10 @@ void xe_tlb_inval_done_handler(struct xe_tlb_inval *tlb_inval, int seqno);
>  
>  bool xe_tlb_inval_idle(struct xe_tlb_inval *tlb_inval);
>  
> +int xe_tlb_inval_range_tilemask_submit(struct xe_device *xe, u32 asid,
> +				       u64 start, u64 end, u8 tile_mask,
> +				       struct xe_tlb_inval_batch *batch);
> +
> +void xe_tlb_inval_batch_wait(struct xe_tlb_inval_batch *batch);
> +
>  #endif	/* _XE_TLB_INVAL_ */
> diff --git a/drivers/gpu/drm/xe/xe_tlb_inval_types.h b/drivers/gpu/drm/xe/xe_tlb_inval_types.h
> index 3b089f90f002..3d1797d186fd 100644
> --- a/drivers/gpu/drm/xe/xe_tlb_inval_types.h
> +++ b/drivers/gpu/drm/xe/xe_tlb_inval_types.h
> @@ -9,6 +9,8 @@
>  #include <linux/workqueue.h>
>  #include <linux/dma-fence.h>
>  
> +#include "xe_device_types.h"
> +
>  struct drm_suballoc;
>  struct xe_tlb_inval;
>  
> @@ -132,4 +134,16 @@ struct xe_tlb_inval_fence {
>  	ktime_t inval_time;
>  };
>  
> +/**
> + * struct xe_tlb_inval_batch - Batch of TLB invalidation fences
> + *
> + * Holds one fence per GT covered by a TLB invalidation request.
> + */
> +struct xe_tlb_inval_batch {
> +	/** @fence: per-GT TLB invalidation fences */
> +	struct xe_tlb_inval_fence fence[XE_MAX_TILES_PER_DEVICE * XE_MAX_GT_PER_TILE];
> +	/** @num_fences: number of valid entries in @fence */
> +	unsigned int num_fences;
> +};
> +
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index 548b0769b3ef..a3c2e8cefec7 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -3966,66 +3966,6 @@ void xe_vm_unlock(struct xe_vm *vm)
>  	dma_resv_unlock(xe_vm_resv(vm));
>  }
>  
> -/**
> - * xe_vm_range_tilemask_tlb_inval - Issue a TLB invalidation on this tilemask for an
> - * address range
> - * @vm: The VM
> - * @start: start address
> - * @end: end address
> - * @tile_mask: mask for which gt's issue tlb invalidation
> - *
> - * Issue a range based TLB invalidation for gt's in tilemask
> - *
> - * Returns 0 for success, negative error code otherwise.
> - */
> -int xe_vm_range_tilemask_tlb_inval(struct xe_vm *vm, u64 start,
> -				   u64 end, u8 tile_mask)
> -{
> -	struct xe_tlb_inval_fence
> -		fence[XE_MAX_TILES_PER_DEVICE * XE_MAX_GT_PER_TILE];
> -	struct xe_tile *tile;
> -	u32 fence_id = 0;
> -	u8 id;
> -	int err;
> -
> -	if (!tile_mask)
> -		return 0;
> -
> -	for_each_tile(tile, vm->xe, id) {
> -		if (!(tile_mask & BIT(id)))
> -			continue;
> -
> -		xe_tlb_inval_fence_init(&tile->primary_gt->tlb_inval,
> -					&fence[fence_id], true);
> -
> -		err = xe_tlb_inval_range(&tile->primary_gt->tlb_inval,
> -					 &fence[fence_id], start, end,
> -					 vm->usm.asid, NULL);
> -		if (err)
> -			goto wait;
> -		++fence_id;
> -
> -		if (!tile->media_gt)
> -			continue;
> -
> -		xe_tlb_inval_fence_init(&tile->media_gt->tlb_inval,
> -					&fence[fence_id], true);
> -
> -		err = xe_tlb_inval_range(&tile->media_gt->tlb_inval,
> -					 &fence[fence_id], start, end,
> -					 vm->usm.asid, NULL);
> -		if (err)
> -			goto wait;
> -		++fence_id;
> -	}
> -
> -wait:
> -	for (id = 0; id < fence_id; ++id)
> -		xe_tlb_inval_fence_wait(&fence[id]);
> -
> -	return err;
> -}
> -
>  /**
>   * xe_vm_invalidate_vma - invalidate GPU mappings for VMA without a lock
>   * @vma: VMA to invalidate
> @@ -4040,6 +3980,7 @@ int xe_vm_invalidate_vma(struct xe_vma *vma)
>  {
>  	struct xe_device *xe = xe_vma_vm(vma)->xe;
>  	struct xe_vm *vm = xe_vma_vm(vma);
> +	struct xe_tlb_inval_batch batch;
>  	struct xe_tile *tile;
>  	u8 tile_mask = 0;
>  	int ret = 0;
> @@ -4080,12 +4021,16 @@ int xe_vm_invalidate_vma(struct xe_vma *vma)
>  
>  	xe_device_wmb(xe);
>  
> -	ret = xe_vm_range_tilemask_tlb_inval(xe_vma_vm(vma), xe_vma_start(vma),
> -					     xe_vma_end(vma), tile_mask);
> +	ret = xe_tlb_inval_range_tilemask_submit(xe, xe_vma_vm(vma)->usm.asid,
> +						 xe_vma_start(vma), xe_vma_end(vma),
> +						 tile_mask, &batch);
>  
>  	/* WRITE_ONCE pairs with READ_ONCE in xe_vm_has_valid_gpu_mapping() */
>  	WRITE_ONCE(vma->tile_invalidated, vma->tile_mask);
>  
> +	if (!ret)
> +		xe_tlb_inval_batch_wait(&batch);
> +
>  	return ret;
>  }
>  
> diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
> index f849e369432b..62f4b6fec0bc 100644
> --- a/drivers/gpu/drm/xe/xe_vm.h
> +++ b/drivers/gpu/drm/xe/xe_vm.h
> @@ -240,9 +240,6 @@ struct dma_fence *xe_vm_range_rebind(struct xe_vm *vm,
>  struct dma_fence *xe_vm_range_unbind(struct xe_vm *vm,
>  				     struct xe_svm_range *range);
>  
> -int xe_vm_range_tilemask_tlb_inval(struct xe_vm *vm, u64 start,
> -				   u64 end, u8 tile_mask);
> -
>  int xe_vm_invalidate_vma(struct xe_vma *vma);
>  
>  int xe_vm_validate_protected(struct xe_vm *vm);
> diff --git a/drivers/gpu/drm/xe/xe_vm_madvise.c b/drivers/gpu/drm/xe/xe_vm_madvise.c
> index 95bf53cc29e3..02daf8a93044 100644
> --- a/drivers/gpu/drm/xe/xe_vm_madvise.c
> +++ b/drivers/gpu/drm/xe/xe_vm_madvise.c
> @@ -12,6 +12,7 @@
>  #include "xe_pat.h"
>  #include "xe_pt.h"
>  #include "xe_svm.h"
> +#include "xe_tlb_inval.h"
>  
>  struct xe_vmas_in_madvise_range {
>  	u64 addr;
> @@ -235,13 +236,20 @@ static u8 xe_zap_ptes_in_madvise_range(struct xe_vm *vm, u64 start, u64 end)
>  static int xe_vm_invalidate_madvise_range(struct xe_vm *vm, u64 start, u64 end)
>  {
>  	u8 tile_mask = xe_zap_ptes_in_madvise_range(vm, start, end);
> +	struct xe_tlb_inval_batch batch;
> +	int err;
>  
>  	if (!tile_mask)
>  		return 0;
>  
>  	xe_device_wmb(vm->xe);
>  
> -	return xe_vm_range_tilemask_tlb_inval(vm, start, end, tile_mask);
> +	err = xe_tlb_inval_range_tilemask_submit(vm->xe, vm->usm.asid, start, end,
> +						 tile_mask, &batch);
> +	if (!err)
> +		xe_tlb_inval_batch_wait(&batch);
> +
> +	return err;
>  }
>  
>  static bool madvise_args_are_sane(struct xe_device *xe, const struct drm_xe_madvise *args)
> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
> index 1f6f7e30e751..de6544165cfa 100644
> --- a/drivers/gpu/drm/xe/xe_vm_types.h
> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> @@ -18,6 +18,7 @@
>  #include "xe_device_types.h"
>  #include "xe_pt_types.h"
>  #include "xe_range_fence.h"
> +#include "xe_tlb_inval_types.h"
>  #include "xe_userptr.h"
>  
>  struct drm_pagemap;
> -- 
> 2.53.0
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Claude review: Two-pass MMU interval notifiers
  2026-03-03 13:34 [PATCH v3 0/4] Two-pass MMU interval notifiers Thomas Hellström
                   ` (3 preceding siblings ...)
  2026-03-03 13:34 ` [PATCH v3 4/4] drm/xe/userptr: Defer Waiting for TLB invalidation to the second pass if possible Thomas Hellström
@ 2026-03-03 21:09 ` Claude Code Review Bot
  4 siblings, 0 replies; 13+ messages in thread
From: Claude Code Review Bot @ 2026-03-03 21:09 UTC (permalink / raw)
  To: dri-devel-reviews

Overall Series Review

Subject: Two-pass MMU interval notifiers
Author: =?UTF-8?q?Thomas=20Hellstr=C3=B6m?= <thomas.hellstrom@linux.intel.com>
Patches: 7
Reviewed: 2026-03-04T07:09:16.400260

---

This is a well-structured v3 series from Thomas Hellstrom that adds two-pass MMU interval notifier support to the mm core and then uses it in the Xe driver to pipeline TLB invalidations across multi-GPU setups. The motivation is clear: in multi-GPU scenarios, the current single-pass model serializes GPU TLB flush submit+wait per notifier, which doesn't scale. The two-pass approach allows submitting flushes to all GPUs first, then waiting for them all in a second pass.

The design is sound overall: patch 1 provides the mm core infrastructure, patch 2 converts xe userptr to the two-pass model with a single embedded finish struct and fallback to synchronous operation, patch 3 refactors TLB invalidation into submit/wait halves, and patch 4 extends the deferred path to also pipeline the TLB flush itself.

I found one clear bug (static local variable in `xe_vma_userptr_force_invalidate`) and have several concerns about the increasing complexity of the state machine in patches 2+4 and the growing per-userptr struct size. The mm core patch (1) looks correct.

---

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Claude review: mm/mmu_notifier: Allow two-pass struct mmu_interval_notifiers
  2026-03-03 13:34 ` [PATCH v3 1/4] mm/mmu_notifier: Allow two-pass struct mmu_interval_notifiers Thomas Hellström
@ 2026-03-03 21:09   ` Claude Code Review Bot
  0 siblings, 0 replies; 13+ messages in thread
From: Claude Code Review Bot @ 2026-03-03 21:09 UTC (permalink / raw)
  To: dri-devel-reviews

Patch Review

This is the core mm infrastructure change. The design uses `invalidate_start` + `invalidate_finish` callbacks in the ops struct, with a lockless list to collect finish entries during the interval tree walk.

**Positive aspects:**
- Clean separation: drivers that don't need two-pass are unaffected (the `invalidate` callback remains)
- Using a lockless list (`llist`) to avoid a second interval tree walk is a good optimization
- The `WARN_ON_ONCE` for `invalidate_start` without `invalidate_finish` is a useful safety check (added per Matt Brost's v3 review)
- Good documentation in the `struct mmu_interval_notifier_finish` kerneldoc about GFP_NOWAIT allocation considerations

**Concern — duplicated dispatching logic:**

The same if/else dispatching logic between `invalidate_start` and `invalidate` is duplicated in both `mn_itree_release` and `mn_itree_invalidate`:

```c
if (interval_sub->ops->invalidate_start) {
    struct mmu_interval_notifier_finish *finish = NULL;
    ret = interval_sub->ops->invalidate_start(..., &finish);
    if (ret && finish) {
        finish->notifier = interval_sub;
        __llist_add(&finish->link, &finish_passes);
    }
} else {
    ret = interval_sub->ops->invalidate(interval_sub, ...);
}
```

This is an exact copy-paste across both functions. Consider extracting a helper like `mn_itree_inv_call()` to reduce the duplication. This is a minor style nit, not a blocker.

**Concern — mn_itree_invalidate -EAGAIN path calls finish on partial set:**

In `mn_itree_invalidate`, when a non-blockable callback returns false, the code breaks out of the loop but still calls `mn_itree_finish_pass(&finish_passes)`. This means `invalidate_finish` callbacks will be invoked for notifiers that were already processed, while the iteration was aborted mid-way:

```c
if (!ret) {
    if (WARN_ON(mmu_notifier_range_blockable(range)))
        continue;
    err = -EAGAIN;
    break;
}
```
followed by:
```c
mn_itree_finish_pass(&finish_passes);
if (err)
    mn_itree_inv_end(subscriptions);
```

This is actually correct behavior — the finish pass must still run for all notifiers that *did* successfully start, to complete their work. The comment should perhaps clarify this is intentional.

**Minor:** The `llist_reverse_order()` call in `mn_itree_finish_pass` ensures FIFO ordering for the finish callbacks (matching the iteration order of the tree walk). This is a nice touch for deterministic behavior.

Overall: **Looks correct.** The WARN_ON_ONCE validation and the v2/v3 refinements are welcome.

---

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Claude review: drm/xe/userptr: Convert invalidation to two-pass MMU notifier
  2026-03-03 13:34 ` [PATCH v3 2/4] drm/xe/userptr: Convert invalidation to two-pass MMU notifier Thomas Hellström
  2026-03-03 18:10   ` Matthew Brost
@ 2026-03-03 21:09   ` Claude Code Review Bot
  1 sibling, 0 replies; 13+ messages in thread
From: Claude Code Review Bot @ 2026-03-03 21:09 UTC (permalink / raw)
  To: dri-devel-reviews

Patch Review

This converts the xe userptr notifier from a single `invalidate` callback to the new `invalidate_start`/`invalidate_finish` two-pass model.

**BUG — static local variable in `xe_vma_userptr_force_invalidate`:**

```c
void xe_vma_userptr_force_invalidate(struct xe_userptr_vma *uvma)
{
	static struct mmu_interval_notifier_finish *finish;
```

This declares `finish` as `static`, meaning it's shared across all calls to this function. This is almost certainly wrong — it should be a plain local variable. Using `static` here would cause data races if this function were ever called concurrently (even though it's under `CONFIG_DRM_XE_USERPTR_INVAL_INJECT` and may currently be single-threaded in testing paths, `static` is still incorrect). Should be:

```c
	struct mmu_interval_notifier_finish *finish;
```

**Design observation — `finish_inuse` flag:**

The `finish_inuse` bool is used to prevent two concurrent invalidations from both trying to use the embedded `struct mmu_interval_notifier_finish`. When `finish_inuse` is set, the second caller falls back to the synchronous single-pass path. This is a pragmatic approach that avoids dynamic allocation.

However, the lock protection for `finish_inuse` should be reviewed carefully. It's documented as:
```
Protected by struct xe_vm::svm.gpusvm.notifier_lock in write mode
alternatively by the same lock in read mode *and* the vm resv held.
```

In `xe_vma_userptr_invalidate_start`, the notifier_lock is held in write mode when `xe_vma_userptr_invalidate_pass1` reads and sets `finish_inuse`, so that's correct. In `xe_vma_userptr_invalidate_finish`, the notifier_lock is also taken in write mode, so clearing `finish_inuse` via `xe_vma_userptr_do_inval(vm, uvma, true)` is also correct.

**Locking assertion `xe_userptr_assert_in_notifier`:**

```c
static void xe_userptr_assert_in_notifier(struct xe_vm *vm)
{
	lockdep_assert(lockdep_is_held_type(&vm->svm.gpusvm.notifier_lock, 0) ||
		       (lockdep_is_held(&vm->lock) &&
			lockdep_is_held_type(&vm->svm.gpusvm.notifier_lock, 1) &&
			dma_resv_held(xe_vm_resv(vm))));
}
```

This asserts either: notifier_lock in write mode, OR (vm->lock held AND notifier_lock in read mode AND resv held). This matches the documented locking for the state members and looks correct.

**Minor nit:** The `xe_vma_userptr_do_inval` function unconditionally does `dma_resv_wait_timeout` with `MAX_SCHEDULE_TIMEOUT` and a non-blockable `false` parameter. This was inherited from the original code but it means this path always sleeps. The `mmu_notifier_range_blockable` check happens earlier in `xe_vma_userptr_invalidate_start`, which returns `false` if non-blockable. So this is fine.

Overall: **One bug (static variable), otherwise looks correct.**

---

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Claude review: drm/xe: Split TLB invalidation into submit and wait steps
  2026-03-03 13:34 ` [PATCH v3 3/4] drm/xe: Split TLB invalidation into submit and wait steps Thomas Hellström
  2026-03-03 18:13   ` Matthew Brost
@ 2026-03-03 21:09   ` Claude Code Review Bot
  1 sibling, 0 replies; 13+ messages in thread
From: Claude Code Review Bot @ 2026-03-03 21:09 UTC (permalink / raw)
  To: dri-devel-reviews

Patch Review

This is a clean refactor that splits `xe_vm_range_tilemask_tlb_inval` into `xe_tlb_inval_range_tilemask_submit` + `xe_tlb_inval_batch_wait`, and converts all callers.

**`xe_tlb_inval_batch` struct:**

```c
struct xe_tlb_inval_batch {
	struct xe_tlb_inval_fence fence[XE_MAX_TILES_PER_DEVICE * XE_MAX_GT_PER_TILE];
	unsigned int num_fences;
};
```

With `XE_MAX_TILES_PER_DEVICE=2` and `XE_MAX_GT_PER_TILE=2`, this is an array of 4 `xe_tlb_inval_fence` structs. Each fence contains a `struct dma_fence` (~72-80 bytes depending on config), a pointer, list_head, int, and ktime_t. The batch struct is roughly ~450-500 bytes. This is fine on the stack (the original code also had a stack-allocated array of the same size). However, in patch 4 this gets **embedded per-userptr**, which adds this to every `struct xe_userptr`. That's a significant per-object overhead worth noting.

**Error handling in `xe_tlb_inval_range_tilemask_submit`:**

```c
wait:
	batch->num_fences = fence_id;
	if (err)
		xe_tlb_inval_batch_wait(batch);
	return err;
```

When `xe_tlb_inval_range` returns an error, the function waits for all previously-submitted fences and returns the error. The doc says "If the function returns an error, there is no need to call `xe_tlb_inval_batch_wait()` on @batch." This is consistent — on error, the internal wait is done and `num_fences` is set to 0 by `xe_tlb_inval_batch_wait`.

**Caller conversion — `xe_vm_invalidate_vma`:**

The original code did the `WRITE_ONCE(vma->tile_invalidated, ...)` after the TLB invalidation completed. The new code does it *after submit but before wait*:

```c
ret = xe_tlb_inval_range_tilemask_submit(xe, ..., &batch);
WRITE_ONCE(vma->tile_invalidated, vma->tile_mask);
if (!ret)
    xe_tlb_inval_batch_wait(&batch);
```

This is a behavioral change: `tile_invalidated` is now set before TLB invalidation completes. This is okay if `tile_invalidated` is a "we've initiated invalidation" flag rather than "invalidation is complete". Let me note that this reordering was introduced in this patch and is kept in patch 4 when `xe_vm_invalidate_vma` is refactored into `xe_vm_invalidate_vma_submit` + `xe_vm_invalidate_vma`. The semantics should be documented clearly, since `READ_ONCE` consumers of `tile_invalidated` may now observe the flag being set before the TLB flush is actually done.

**`xe_tlb_inval_types.h` includes `xe_device_types.h`:**

```c
#include "xe_device_types.h"
```

This is added to get the `XE_MAX_TILES_PER_DEVICE` define for the array size. This is a heavyweight include for a `_types.h` header and could potentially create circular dependency issues. Consider forward-declaring or using a constant directly instead.

Overall: **Correct refactor, minor concern about `tile_invalidated` reordering semantics and header weight.**

---

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Claude review: drm/xe/userptr: Defer Waiting for TLB invalidation to the second pass if possible
  2026-03-03 13:34 ` [PATCH v3 4/4] drm/xe/userptr: Defer Waiting for TLB invalidation to the second pass if possible Thomas Hellström
@ 2026-03-03 21:09   ` Claude Code Review Bot
  0 siblings, 0 replies; 13+ messages in thread
From: Claude Code Review Bot @ 2026-03-03 21:09 UTC (permalink / raw)
  To: dri-devel-reviews

Patch Review

This is the most complex patch, adding a *third* effective pass: pass1 defers fence wait, finish-pass does fence wait + submits TLB invalidation, then another finish-pass waits for TLB invalidation. The cover letter documents three flows:

```
pass1 (fences pending)  -> invalidate_finish -> do_inval (sync TLB)
pass1 (fences done)     -> do_inval -> invalidate_finish
                        -> complete_tlb_inval (deferred TLB)
pass1 (finish occupied) -> do_inval (sync TLB, inline)
```

**Concern — three-pass through a two-pass API:**

The mmu_notifier infrastructure in patch 1 is designed for two passes. This patch achieves a third pass by having `xe_vma_userptr_do_inval` (called from `invalidate_finish`) return *another* `mmu_interval_notifier_finish*`, effectively re-queuing the notifier for another finish pass. However, looking at the mm core code in `mn_itree_finish_pass`:

```c
static void mn_itree_finish_pass(struct llist_head *finish_passes)
{
	struct llist_node *first = llist_reverse_order(__llist_del_all(finish_passes));
	struct mmu_interval_notifier_finish *f, *next;

	llist_for_each_entry_safe(f, next, first, link)
		f->notifier->ops->invalidate_finish(f);
}
```

This drains the list and calls `invalidate_finish` for each entry. There's **no mechanism** for `invalidate_finish` to re-queue itself for yet another pass — the list is already drained when finish runs. Looking more carefully at `xe_vma_userptr_do_inval` in this patch:

```c
if (!userptr->finish_inuse) {
    userptr->finish_inuse = true;
    userptr->tlb_inval_submitted = true;
    err = xe_vm_invalidate_vma_submit(vma, &userptr->inval_batch);
    XE_WARN_ON(err);
    return &userptr->finish;
}
```

This returns `&userptr->finish` from within `xe_vma_userptr_do_inval`, which was called from `xe_vma_userptr_invalidate_finish`. But `invalidate_finish` in the mm core doesn't look at any return value — it's `void`:

```c
void (*invalidate_finish)(struct mmu_interval_notifier_finish *finish);
```

So the return value from `xe_vma_userptr_do_inval` in the `invalidate_finish` path goes... where? Let's trace the call:

```c
static void xe_vma_userptr_invalidate_finish(...)
{
    down_write(&vm->svm.gpusvm.notifier_lock);
    if (uvma->userptr.tlb_inval_submitted)
        xe_vma_userptr_complete_tlb_inval(vm, uvma);
    else
        xe_vma_userptr_do_inval(vm, uvma, true);  // return value IGNORED
    up_write(&vm->svm.gpusvm.notifier_lock);
}
```

The return value of `xe_vma_userptr_do_inval` is **ignored** in `invalidate_finish`. So when `xe_vma_userptr_do_inval` is called with `is_deferred=true` and `!finish_inuse`, it sets `finish_inuse=true`, `tlb_inval_submitted=true`, submits the TLB invalidation, and returns `&userptr->finish`. But nobody picks up this return value to schedule another finish pass.

**This means the TLB invalidation batch submitted in this path is never waited on.** The `xe_vma_userptr_complete_tlb_inval` function would need to be called, but it's only called when `tlb_inval_submitted` is true at the *entry* of `xe_vma_userptr_invalidate_finish`. Since the finish callback is only invoked once per `__llist_add`, this deferred TLB path appears to leak — `finish_inuse` remains true, `tlb_inval_submitted` remains true, and the TLB batch is never waited on. The next notifier call would see `finish_inuse=true` and fall back to the synchronous path, which works, but `tlb_inval_submitted` being stale would cause `xe_vma_userptr_invalidate_finish` to call `xe_vma_userptr_complete_tlb_inval` on a stale/reused batch.

Wait — let me re-read more carefully. The `is_deferred=true` path enters with `finish_inuse` already set (the assert checks it):
```c
if (is_deferred)
    xe_assert(vm->xe, userptr->finish_inuse && !userptr->tlb_inval_submitted);
```

So when `xe_vma_userptr_do_inval` is called with `is_deferred=true`, `finish_inuse` is already true. Then the `if (!userptr->finish_inuse)` check is false, so it falls through to the synchronous `xe_vm_invalidate_vma()` path. So actually the deferred TLB submit via `do_inval` **only happens in the non-deferred `is_deferred=false` path** where `finish_inuse` is initially false. Let me re-trace:

In `xe_vma_userptr_invalidate_pass1`, when `signaled || finish_inuse`:
```c
if (signaled || userptr->finish_inuse)
    return xe_vma_userptr_do_inval(vm, uvma, false);  // is_deferred=false
```

Here `is_deferred=false`. If `!finish_inuse` (i.e., we entered because `signaled` was true), then `xe_vma_userptr_do_inval` may return `&userptr->finish` after setting `finish_inuse=true` and `tlb_inval_submitted=true`. This return value goes back to `xe_vma_userptr_invalidate_pass1`, then to `xe_vma_userptr_invalidate_start`, which sets `*p_finish` to it. This **does** schedule the finish pass in the mm core via `__llist_add`. Then `xe_vma_userptr_invalidate_finish` is called, sees `tlb_inval_submitted=true`, and calls `xe_vma_userptr_complete_tlb_inval`. This path is **correct**.

The other case: when `finish_inuse` is already true (concurrent invalidation), `xe_vma_userptr_do_inval(vm, uvma, false)` with `finish_inuse=true` falls through to sync `xe_vm_invalidate_vma` and returns NULL. Correct.

And the deferred path from pass1: when `!signaled && !finish_inuse`, pass1 returns `&userptr->finish` directly, finish is called, and `do_inval` with `is_deferred=true` does the sync path (since `finish_inuse` is already true). So the deferred TLB only happens in the "fences already signaled but TLB flush needed" case. That makes sense — if fences are done, you can immediately submit TLB flush and defer the wait.

After re-analysis: **The logic is correct but extremely convoluted.** The three interleaved flows with the `finish_inuse` and `tlb_inval_submitted` booleans create a complex state machine that's hard to reason about. The assertions help, but this would benefit from a state diagram in a comment.

**Per-userptr struct bloat:**

Patch 4 embeds `struct xe_tlb_inval_batch` (~450 bytes) into every `struct xe_userptr`. This is added to the already-present `struct mmu_interval_notifier_finish` (from patch 2). For systems with many userptr VMAs, this is significant memory overhead for a feature that only activates in fault mode with multi-GPU. Consider whether the batch could be dynamically allocated (with fallback to sync) instead of embedded.

**`xe_vma_userptr_force_invalidate` in patch 4:**

```c
finish = xe_vma_userptr_invalidate_pass1(vm, uvma);
if (finish)
    finish = xe_vma_userptr_do_inval(vm, uvma, true);
if (finish)
    xe_vma_userptr_complete_tlb_inval(vm, uvma);
```

This still uses the `static` variable from patch 2 (which is a bug as noted above). Also, the second `if (finish)` can never be true because `do_inval` with `is_deferred=true` always falls through to the sync path (since `finish_inuse` is already true when `is_deferred=true`). The third line is dead code in the current implementation. The `static` bug and this dead code suggest the force_invalidate path hasn't been fully thought through.

Overall: **Logic is correct after careful analysis, but the state machine is very complex. Bug inherited from patch 2 (static variable). Per-userptr memory overhead is significant. Dead code in force_invalidate.**

---

## Summary of Issues

| Severity | Patch | Issue |
|----------|-------|-------|
| Bug | 2/4 | `static struct mmu_interval_notifier_finish *finish` in `xe_vma_userptr_force_invalidate` — should be a local variable |
| Medium | 4/4 | ~450 bytes of `xe_tlb_inval_batch` embedded per-userptr for a feature only used in fault mode multi-GPU |
| Medium | 4/4 | Three-pass state machine through a two-pass API is hard to follow; would benefit from a state diagram comment |
| Low | 3/4 | `tile_invalidated` is now set before TLB flush completes — semantics should be documented |
| Low | 3/4 | `xe_tlb_inval_types.h` pulls in heavyweight `xe_device_types.h` for a constant |
| Nit | 1/4 | Duplicated dispatch logic between `mn_itree_release` and `mn_itree_invalidate` could use a helper |
| Nit | 4/4 | Third `if (finish)` in `xe_vma_userptr_force_invalidate` is dead code |

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-03-03 21:09 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-03 13:34 [PATCH v3 0/4] Two-pass MMU interval notifiers Thomas Hellström
2026-03-03 13:34 ` [PATCH v3 1/4] mm/mmu_notifier: Allow two-pass struct mmu_interval_notifiers Thomas Hellström
2026-03-03 21:09   ` Claude review: " Claude Code Review Bot
2026-03-03 13:34 ` [PATCH v3 2/4] drm/xe/userptr: Convert invalidation to two-pass MMU notifier Thomas Hellström
2026-03-03 18:10   ` Matthew Brost
2026-03-03 21:09   ` Claude review: " Claude Code Review Bot
2026-03-03 13:34 ` [PATCH v3 3/4] drm/xe: Split TLB invalidation into submit and wait steps Thomas Hellström
2026-03-03 18:13   ` Matthew Brost
2026-03-03 21:09   ` Claude review: " Claude Code Review Bot
2026-03-03 13:34 ` [PATCH v3 4/4] drm/xe/userptr: Defer Waiting for TLB invalidation to the second pass if possible Thomas Hellström
2026-03-03 21:09   ` Claude review: " Claude Code Review Bot
2026-03-03 21:09 ` Claude review: Two-pass MMU interval notifiers Claude Code Review Bot
  -- strict thread matches above, loose matches on Subject: below --
2026-03-02 16:32 [PATCH v2 0/4] " Thomas Hellström
2026-03-02 16:32 ` [PATCH v2 4/4] drm/xe/userptr: Defer Waiting for TLB invalidation to the second pass if possible Thomas Hellström
2026-03-03  3:05   ` Claude review: " Claude Code Review Bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox