* [PATCH v2 1/4] mm/mmu_notifier: Allow two-pass struct mmu_interval_notifiers
2026-03-02 16:32 [PATCH v2 0/4] Two-pass MMU interval notifiers Thomas Hellström
@ 2026-03-02 16:32 ` Thomas Hellström
2026-03-02 19:48 ` Matthew Brost
2026-03-03 3:05 ` Claude review: " Claude Code Review Bot
2026-03-02 16:32 ` [PATCH v2 2/4] drm/xe/userptr: Convert invalidation to two-pass MMU notifier Thomas Hellström
` (3 subsequent siblings)
4 siblings, 2 replies; 21+ messages in thread
From: Thomas Hellström @ 2026-03-02 16:32 UTC (permalink / raw)
To: intel-xe
Cc: Thomas Hellström, Jason Gunthorpe, Andrew Morton,
Simona Vetter, Dave Airlie, Alistair Popple, dri-devel, linux-mm,
linux-kernel, Matthew Brost, Christian König
GPU use-cases for mmu_interval_notifiers with hmm often involve
starting a gpu operation and then waiting for it to complete.
These operations are typically context preemption or TLB flushing.
With single-pass notifiers per GPU this doesn't scale in
multi-gpu scenarios. In those scenarios we'd want to first start
preemption- or TLB flushing on all GPUs and as a second pass wait
for them to complete.
One can do this on per-driver basis multiplexing per-driver
notifiers but that would mean sharing the notifier "user" lock
across all GPUs and that doesn't scale well either, so adding support
for multi-pass in the core appears to be the right choice.
Implement two-pass capability in the mmu_interval_notifier. Use a
linked list for the final passes to minimize the impact for
use-cases that don't need the multi-pass functionality by avoiding
a second interval tree walk, and to be able to easily pass data
between the two passes.
v1:
- Restrict to two passes (Jason Gunthorpe)
- Improve on documentation (Jason Gunthorpe)
- Improve on function naming (Alistair Popple)
v2:
- Include the invalidate_finish() callback in the
struct mmu_interval_notifier_ops.
- Update documentation (GitHub Copilot:claude-sonnet-4.6)
- Use lockless list for list management.
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Simona Vetter <simona.vetter@ffwll.ch>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: <dri-devel@lists.freedesktop.org>
Cc: <linux-mm@kvack.org>
Cc: <linux-kernel@vger.kernel.org>
Assisted-by: GitHub Copilot:claude-sonnet-4.6 # Documentation only.
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
include/linux/mmu_notifier.h | 38 +++++++++++++++++++++
mm/mmu_notifier.c | 64 +++++++++++++++++++++++++++++++-----
2 files changed, 93 insertions(+), 9 deletions(-)
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 07a2bbaf86e9..de0e742ea808 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -233,16 +233,54 @@ struct mmu_notifier {
unsigned int users;
};
+/**
+ * struct mmu_interval_notifier_finish - mmu_interval_notifier two-pass abstraction
+ * @link: List link for the notifiers pending pass list
+ * @notifier: The mmu_interval_notifier for which the finish pass is called.
+ *
+ * Allocate, typically using GFP_NOWAIT in the interval notifier's first pass.
+ * If allocation fails (which is not unlikely under memory pressure), fall back
+ * to single-pass operation. Note that with a large number of notifiers
+ * implementing two passes, allocation with GFP_NOWAIT will become increasingly
+ * likely to fail, so consider implementing a small pool instead of using
+ * kmalloc() allocations.
+ *
+ * If the implementation needs to pass data between the two passes,
+ * the recommended way is to embed struct mmu_interval_notifier_finish into a larger
+ * structure that also contains the data needed to be shared. Keep in mind that
+ * a notifier callback can be invoked in parallel, and each invocation needs its
+ * own struct mmu_interval_notifier_finish.
+ */
+struct mmu_interval_notifier_finish {
+ struct llist_node link;
+ struct mmu_interval_notifier *notifier;
+};
+
/**
* struct mmu_interval_notifier_ops
* @invalidate: Upon return the caller must stop using any SPTEs within this
* range. This function can sleep. Return false only if sleeping
* was required but mmu_notifier_range_blockable(range) is false.
+ * @invalidate_start: Similar to @invalidate, but intended for two-pass notifier
+ * callbacks where the call to @invalidate_start is the first
+ * pass and any struct mmu_interval_notifier_finish pointer
+ * returned in the @finish parameter describes the final pass.
+ * If @finish is %NULL on return, then no final pass will be
+ * called.
+ * @invalidate_finish: Called as the second pass for any notifier that returned
+ * a non-NULL @finish from @invalidate_start. The @finish
+ * pointer passed here is the same one returned by
+ * @invalidate_start.
*/
struct mmu_interval_notifier_ops {
bool (*invalidate)(struct mmu_interval_notifier *interval_sub,
const struct mmu_notifier_range *range,
unsigned long cur_seq);
+ bool (*invalidate_start)(struct mmu_interval_notifier *interval_sub,
+ const struct mmu_notifier_range *range,
+ unsigned long cur_seq,
+ struct mmu_interval_notifier_finish **finish);
+ void (*invalidate_finish)(struct mmu_interval_notifier_finish *finish);
};
struct mmu_interval_notifier {
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index a6cdf3674bdc..38acd5ef8eb0 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -260,6 +260,15 @@ mmu_interval_read_begin(struct mmu_interval_notifier *interval_sub)
}
EXPORT_SYMBOL_GPL(mmu_interval_read_begin);
+static void mn_itree_finish_pass(struct llist_head *finish_passes)
+{
+ struct llist_node *first = llist_reverse_order(__llist_del_all(finish_passes));
+ struct mmu_interval_notifier_finish *f, *next;
+
+ llist_for_each_entry_safe(f, next, first, link)
+ f->notifier->ops->invalidate_finish(f);
+}
+
static void mn_itree_release(struct mmu_notifier_subscriptions *subscriptions,
struct mm_struct *mm)
{
@@ -271,6 +280,7 @@ static void mn_itree_release(struct mmu_notifier_subscriptions *subscriptions,
.end = ULONG_MAX,
};
struct mmu_interval_notifier *interval_sub;
+ LLIST_HEAD(finish_passes);
unsigned long cur_seq;
bool ret;
@@ -278,11 +288,27 @@ static void mn_itree_release(struct mmu_notifier_subscriptions *subscriptions,
mn_itree_inv_start_range(subscriptions, &range, &cur_seq);
interval_sub;
interval_sub = mn_itree_inv_next(interval_sub, &range)) {
- ret = interval_sub->ops->invalidate(interval_sub, &range,
- cur_seq);
+ if (interval_sub->ops->invalidate_start) {
+ struct mmu_interval_notifier_finish *finish = NULL;
+
+ ret = interval_sub->ops->invalidate_start(interval_sub,
+ &range,
+ cur_seq,
+ &finish);
+ if (ret && finish) {
+ finish->notifier = interval_sub;
+ __llist_add(&finish->link, &finish_passes);
+ }
+
+ } else {
+ ret = interval_sub->ops->invalidate(interval_sub,
+ &range,
+ cur_seq);
+ }
WARN_ON(!ret);
}
+ mn_itree_finish_pass(&finish_passes);
mn_itree_inv_end(subscriptions);
}
@@ -430,7 +456,9 @@ static int mn_itree_invalidate(struct mmu_notifier_subscriptions *subscriptions,
const struct mmu_notifier_range *range)
{
struct mmu_interval_notifier *interval_sub;
+ LLIST_HEAD(finish_passes);
unsigned long cur_seq;
+ int err = 0;
for (interval_sub =
mn_itree_inv_start_range(subscriptions, range, &cur_seq);
@@ -438,23 +466,41 @@ static int mn_itree_invalidate(struct mmu_notifier_subscriptions *subscriptions,
interval_sub = mn_itree_inv_next(interval_sub, range)) {
bool ret;
- ret = interval_sub->ops->invalidate(interval_sub, range,
- cur_seq);
+ if (interval_sub->ops->invalidate_start) {
+ struct mmu_interval_notifier_finish *finish = NULL;
+
+ ret = interval_sub->ops->invalidate_start(interval_sub,
+ range,
+ cur_seq,
+ &finish);
+ if (ret && finish) {
+ finish->notifier = interval_sub;
+ __llist_add(&finish->link, &finish_passes);
+ }
+
+ } else {
+ ret = interval_sub->ops->invalidate(interval_sub,
+ range,
+ cur_seq);
+ }
if (!ret) {
if (WARN_ON(mmu_notifier_range_blockable(range)))
continue;
- goto out_would_block;
+ err = -EAGAIN;
+ break;
}
}
- return 0;
-out_would_block:
+ mn_itree_finish_pass(&finish_passes);
+
/*
* On -EAGAIN the non-blocking caller is not allowed to call
* invalidate_range_end()
*/
- mn_itree_inv_end(subscriptions);
- return -EAGAIN;
+ if (err)
+ mn_itree_inv_end(subscriptions);
+
+ return err;
}
static int mn_hlist_invalidate_range_start(
--
2.53.0
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH v2 1/4] mm/mmu_notifier: Allow two-pass struct mmu_interval_notifiers
2026-03-02 16:32 ` [PATCH v2 1/4] mm/mmu_notifier: Allow two-pass struct mmu_interval_notifiers Thomas Hellström
@ 2026-03-02 19:48 ` Matthew Brost
2026-03-02 21:12 ` Thomas Hellström
2026-03-03 3:05 ` Claude review: " Claude Code Review Bot
1 sibling, 1 reply; 21+ messages in thread
From: Matthew Brost @ 2026-03-02 19:48 UTC (permalink / raw)
To: Thomas Hellström
Cc: intel-xe, Jason Gunthorpe, Andrew Morton, Simona Vetter,
Dave Airlie, Alistair Popple, dri-devel, linux-mm, linux-kernel,
Christian König
On Mon, Mar 02, 2026 at 05:32:45PM +0100, Thomas Hellström wrote:
> GPU use-cases for mmu_interval_notifiers with hmm often involve
> starting a gpu operation and then waiting for it to complete.
> These operations are typically context preemption or TLB flushing.
>
> With single-pass notifiers per GPU this doesn't scale in
> multi-gpu scenarios. In those scenarios we'd want to first start
> preemption- or TLB flushing on all GPUs and as a second pass wait
> for them to complete.
>
> One can do this on per-driver basis multiplexing per-driver
> notifiers but that would mean sharing the notifier "user" lock
> across all GPUs and that doesn't scale well either, so adding support
> for multi-pass in the core appears to be the right choice.
>
> Implement two-pass capability in the mmu_interval_notifier. Use a
> linked list for the final passes to minimize the impact for
> use-cases that don't need the multi-pass functionality by avoiding
> a second interval tree walk, and to be able to easily pass data
> between the two passes.
>
> v1:
> - Restrict to two passes (Jason Gunthorpe)
> - Improve on documentation (Jason Gunthorpe)
> - Improve on function naming (Alistair Popple)
> v2:
> - Include the invalidate_finish() callback in the
> struct mmu_interval_notifier_ops.
> - Update documentation (GitHub Copilot:claude-sonnet-4.6)
> - Use lockless list for list management.
>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
I thought Jason had given a RB on previous revs - did you drop because
enough has changed?
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Simona Vetter <simona.vetter@ffwll.ch>
> Cc: Dave Airlie <airlied@gmail.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: <dri-devel@lists.freedesktop.org>
> Cc: <linux-mm@kvack.org>
> Cc: <linux-kernel@vger.kernel.org>
>
> Assisted-by: GitHub Copilot:claude-sonnet-4.6 # Documentation only.
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> ---
> include/linux/mmu_notifier.h | 38 +++++++++++++++++++++
> mm/mmu_notifier.c | 64 +++++++++++++++++++++++++++++++-----
> 2 files changed, 93 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 07a2bbaf86e9..de0e742ea808 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -233,16 +233,54 @@ struct mmu_notifier {
> unsigned int users;
> };
>
> +/**
> + * struct mmu_interval_notifier_finish - mmu_interval_notifier two-pass abstraction
> + * @link: List link for the notifiers pending pass list
Lockless list?
> + * @notifier: The mmu_interval_notifier for which the finish pass is called.
> + *
> + * Allocate, typically using GFP_NOWAIT in the interval notifier's first pass.
> + * If allocation fails (which is not unlikely under memory pressure), fall back
> + * to single-pass operation. Note that with a large number of notifiers
> + * implementing two passes, allocation with GFP_NOWAIT will become increasingly
> + * likely to fail, so consider implementing a small pool instead of using
> + * kmalloc() allocations.
> + *
> + * If the implementation needs to pass data between the two passes,
> + * the recommended way is to embed struct mmu_interval_notifier_finish into a larger
> + * structure that also contains the data needed to be shared. Keep in mind that
> + * a notifier callback can be invoked in parallel, and each invocation needs its
> + * own struct mmu_interval_notifier_finish.
> + */
> +struct mmu_interval_notifier_finish {
> + struct llist_node link;
> + struct mmu_interval_notifier *notifier;
> +};
> +
> /**
> * struct mmu_interval_notifier_ops
> * @invalidate: Upon return the caller must stop using any SPTEs within this
> * range. This function can sleep. Return false only if sleeping
> * was required but mmu_notifier_range_blockable(range) is false.
> + * @invalidate_start: Similar to @invalidate, but intended for two-pass notifier
> + * callbacks where the call to @invalidate_start is the first
> + * pass and any struct mmu_interval_notifier_finish pointer
> + * returned in the @finish parameter describes the final pass.
> + * If @finish is %NULL on return, then no final pass will be
> + * called.
> + * @invalidate_finish: Called as the second pass for any notifier that returned
> + * a non-NULL @finish from @invalidate_start. The @finish
> + * pointer passed here is the same one returned by
> + * @invalidate_start.
> */
> struct mmu_interval_notifier_ops {
> bool (*invalidate)(struct mmu_interval_notifier *interval_sub,
> const struct mmu_notifier_range *range,
> unsigned long cur_seq);
> + bool (*invalidate_start)(struct mmu_interval_notifier *interval_sub,
> + const struct mmu_notifier_range *range,
> + unsigned long cur_seq,
> + struct mmu_interval_notifier_finish **finish);
> + void (*invalidate_finish)(struct mmu_interval_notifier_finish *finish);
Should we complain somewhere if a caller registers a notifier with
invalidate_start set but not invalidate_finish?
Matt
> };
>
> struct mmu_interval_notifier {
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index a6cdf3674bdc..38acd5ef8eb0 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -260,6 +260,15 @@ mmu_interval_read_begin(struct mmu_interval_notifier *interval_sub)
> }
> EXPORT_SYMBOL_GPL(mmu_interval_read_begin);
>
> +static void mn_itree_finish_pass(struct llist_head *finish_passes)
> +{
> + struct llist_node *first = llist_reverse_order(__llist_del_all(finish_passes));
> + struct mmu_interval_notifier_finish *f, *next;
> +
> + llist_for_each_entry_safe(f, next, first, link)
> + f->notifier->ops->invalidate_finish(f);
> +}
> +
> static void mn_itree_release(struct mmu_notifier_subscriptions *subscriptions,
> struct mm_struct *mm)
> {
> @@ -271,6 +280,7 @@ static void mn_itree_release(struct mmu_notifier_subscriptions *subscriptions,
> .end = ULONG_MAX,
> };
> struct mmu_interval_notifier *interval_sub;
> + LLIST_HEAD(finish_passes);
> unsigned long cur_seq;
> bool ret;
>
> @@ -278,11 +288,27 @@ static void mn_itree_release(struct mmu_notifier_subscriptions *subscriptions,
> mn_itree_inv_start_range(subscriptions, &range, &cur_seq);
> interval_sub;
> interval_sub = mn_itree_inv_next(interval_sub, &range)) {
> - ret = interval_sub->ops->invalidate(interval_sub, &range,
> - cur_seq);
> + if (interval_sub->ops->invalidate_start) {
> + struct mmu_interval_notifier_finish *finish = NULL;
> +
> + ret = interval_sub->ops->invalidate_start(interval_sub,
> + &range,
> + cur_seq,
> + &finish);
> + if (ret && finish) {
> + finish->notifier = interval_sub;
> + __llist_add(&finish->link, &finish_passes);
> + }
> +
> + } else {
> + ret = interval_sub->ops->invalidate(interval_sub,
> + &range,
> + cur_seq);
> + }
> WARN_ON(!ret);
> }
>
> + mn_itree_finish_pass(&finish_passes);
> mn_itree_inv_end(subscriptions);
> }
>
> @@ -430,7 +456,9 @@ static int mn_itree_invalidate(struct mmu_notifier_subscriptions *subscriptions,
> const struct mmu_notifier_range *range)
> {
> struct mmu_interval_notifier *interval_sub;
> + LLIST_HEAD(finish_passes);
> unsigned long cur_seq;
> + int err = 0;
>
> for (interval_sub =
> mn_itree_inv_start_range(subscriptions, range, &cur_seq);
> @@ -438,23 +466,41 @@ static int mn_itree_invalidate(struct mmu_notifier_subscriptions *subscriptions,
> interval_sub = mn_itree_inv_next(interval_sub, range)) {
> bool ret;
>
> - ret = interval_sub->ops->invalidate(interval_sub, range,
> - cur_seq);
> + if (interval_sub->ops->invalidate_start) {
> + struct mmu_interval_notifier_finish *finish = NULL;
> +
> + ret = interval_sub->ops->invalidate_start(interval_sub,
> + range,
> + cur_seq,
> + &finish);
> + if (ret && finish) {
> + finish->notifier = interval_sub;
> + __llist_add(&finish->link, &finish_passes);
> + }
> +
> + } else {
> + ret = interval_sub->ops->invalidate(interval_sub,
> + range,
> + cur_seq);
> + }
> if (!ret) {
> if (WARN_ON(mmu_notifier_range_blockable(range)))
> continue;
> - goto out_would_block;
> + err = -EAGAIN;
> + break;
> }
> }
> - return 0;
>
> -out_would_block:
> + mn_itree_finish_pass(&finish_passes);
> +
> /*
> * On -EAGAIN the non-blocking caller is not allowed to call
> * invalidate_range_end()
> */
> - mn_itree_inv_end(subscriptions);
> - return -EAGAIN;
> + if (err)
> + mn_itree_inv_end(subscriptions);
> +
> + return err;
> }
>
> static int mn_hlist_invalidate_range_start(
> --
> 2.53.0
>
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH v2 1/4] mm/mmu_notifier: Allow two-pass struct mmu_interval_notifiers
2026-03-02 19:48 ` Matthew Brost
@ 2026-03-02 21:12 ` Thomas Hellström
0 siblings, 0 replies; 21+ messages in thread
From: Thomas Hellström @ 2026-03-02 21:12 UTC (permalink / raw)
To: Matthew Brost
Cc: intel-xe, Jason Gunthorpe, Andrew Morton, Simona Vetter,
Dave Airlie, Alistair Popple, dri-devel, linux-mm, linux-kernel,
Christian König
Hi, Matt,
On Mon, 2026-03-02 at 11:48 -0800, Matthew Brost wrote:
Thanks for reviewing.
> On Mon, Mar 02, 2026 at 05:32:45PM +0100, Thomas Hellström wrote:
> > GPU use-cases for mmu_interval_notifiers with hmm often involve
> > starting a gpu operation and then waiting for it to complete.
> > These operations are typically context preemption or TLB flushing.
> >
> > With single-pass notifiers per GPU this doesn't scale in
> > multi-gpu scenarios. In those scenarios we'd want to first start
> > preemption- or TLB flushing on all GPUs and as a second pass wait
> > for them to complete.
> >
> > One can do this on per-driver basis multiplexing per-driver
> > notifiers but that would mean sharing the notifier "user" lock
> > across all GPUs and that doesn't scale well either, so adding
> > support
> > for multi-pass in the core appears to be the right choice.
> >
> > Implement two-pass capability in the mmu_interval_notifier. Use a
> > linked list for the final passes to minimize the impact for
> > use-cases that don't need the multi-pass functionality by avoiding
> > a second interval tree walk, and to be able to easily pass data
> > between the two passes.
> >
> > v1:
> > - Restrict to two passes (Jason Gunthorpe)
> > - Improve on documentation (Jason Gunthorpe)
> > - Improve on function naming (Alistair Popple)
> > v2:
> > - Include the invalidate_finish() callback in the
> > struct mmu_interval_notifier_ops.
> > - Update documentation (GitHub Copilot:claude-sonnet-4.6)
> > - Use lockless list for list management.
> >
> > Cc: Jason Gunthorpe <jgg@ziepe.ca>
>
> I thought Jason had given a RB on previous revs - did you drop
> because
> enough has changed?
Yes. In particular the inclusion of
invalidate_finish() in the ops, although IIRC that
was actually suggested by Jason.
>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Simona Vetter <simona.vetter@ffwll.ch>
> > Cc: Dave Airlie <airlied@gmail.com>
> > Cc: Alistair Popple <apopple@nvidia.com>
> > Cc: <dri-devel@lists.freedesktop.org>
> > Cc: <linux-mm@kvack.org>
> > Cc: <linux-kernel@vger.kernel.org>
> >
> > Assisted-by: GitHub Copilot:claude-sonnet-4.6 # Documentation only.
> > Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > ---
> > include/linux/mmu_notifier.h | 38 +++++++++++++++++++++
> > mm/mmu_notifier.c | 64 +++++++++++++++++++++++++++++++-
> > ----
> > 2 files changed, 93 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/mmu_notifier.h
> > b/include/linux/mmu_notifier.h
> > index 07a2bbaf86e9..de0e742ea808 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -233,16 +233,54 @@ struct mmu_notifier {
> > unsigned int users;
> > };
> >
> > +/**
> > + * struct mmu_interval_notifier_finish - mmu_interval_notifier
> > two-pass abstraction
> > + * @link: List link for the notifiers pending pass list
>
> Lockless list?
Sure, Can add that.
>
> > + * @notifier: The mmu_interval_notifier for which the finish pass
> > is called.
> > + *
> > + * Allocate, typically using GFP_NOWAIT in the interval notifier's
> > first pass.
> > + * If allocation fails (which is not unlikely under memory
> > pressure), fall back
> > + * to single-pass operation. Note that with a large number of
> > notifiers
> > + * implementing two passes, allocation with GFP_NOWAIT will become
> > increasingly
> > + * likely to fail, so consider implementing a small pool instead
> > of using
> > + * kmalloc() allocations.
> > + *
> > + * If the implementation needs to pass data between the two
> > passes,
> > + * the recommended way is to embed struct
> > mmu_interval_notifier_finish into a larger
> > + * structure that also contains the data needed to be shared. Keep
> > in mind that
> > + * a notifier callback can be invoked in parallel, and each
> > invocation needs its
> > + * own struct mmu_interval_notifier_finish.
> > + */
> > +struct mmu_interval_notifier_finish {
> > + struct llist_node link;
> > + struct mmu_interval_notifier *notifier;
> > +};
> > +
> > /**
> > * struct mmu_interval_notifier_ops
> > * @invalidate: Upon return the caller must stop using any SPTEs
> > within this
> > * range. This function can sleep. Return false only
> > if sleeping
> > * was required but
> > mmu_notifier_range_blockable(range) is false.
> > + * @invalidate_start: Similar to @invalidate, but intended for
> > two-pass notifier
> > + * callbacks where the call to
> > @invalidate_start is the first
> > + * pass and any struct
> > mmu_interval_notifier_finish pointer
> > + * returned in the @finish parameter describes
> > the final pass.
> > + * If @finish is %NULL on return, then no final
> > pass will be
> > + * called.
> > + * @invalidate_finish: Called as the second pass for any notifier
> > that returned
> > + * a non-NULL @finish from @invalidate_start.
> > The @finish
> > + * pointer passed here is the same one
> > returned by
> > + * @invalidate_start.
> > */
> > struct mmu_interval_notifier_ops {
> > bool (*invalidate)(struct mmu_interval_notifier
> > *interval_sub,
> > const struct mmu_notifier_range *range,
> > unsigned long cur_seq);
> > + bool (*invalidate_start)(struct mmu_interval_notifier
> > *interval_sub,
> > + const struct mmu_notifier_range
> > *range,
> > + unsigned long cur_seq,
> > + struct
> > mmu_interval_notifier_finish **finish);
> > + void (*invalidate_finish)(struct
> > mmu_interval_notifier_finish *finish);
>
> Should we complain somewhere if a caller registers a notifier with
> invalidate_start set but not invalidate_finish?
Good idea. I'll update.
Thanks,
Thomas
>
> Matt
>
> > };
> >
> > struct mmu_interval_notifier {
> > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> > index a6cdf3674bdc..38acd5ef8eb0 100644
> > --- a/mm/mmu_notifier.c
> > +++ b/mm/mmu_notifier.c
> > @@ -260,6 +260,15 @@ mmu_interval_read_begin(struct
> > mmu_interval_notifier *interval_sub)
> > }
> > EXPORT_SYMBOL_GPL(mmu_interval_read_begin);
> >
> > +static void mn_itree_finish_pass(struct llist_head *finish_passes)
> > +{
> > + struct llist_node *first =
> > llist_reverse_order(__llist_del_all(finish_passes));
> > + struct mmu_interval_notifier_finish *f, *next;
> > +
> > + llist_for_each_entry_safe(f, next, first, link)
> > + f->notifier->ops->invalidate_finish(f);
> > +}
> > +
> > static void mn_itree_release(struct mmu_notifier_subscriptions
> > *subscriptions,
> > struct mm_struct *mm)
> > {
> > @@ -271,6 +280,7 @@ static void mn_itree_release(struct
> > mmu_notifier_subscriptions *subscriptions,
> > .end = ULONG_MAX,
> > };
> > struct mmu_interval_notifier *interval_sub;
> > + LLIST_HEAD(finish_passes);
> > unsigned long cur_seq;
> > bool ret;
> >
> > @@ -278,11 +288,27 @@ static void mn_itree_release(struct
> > mmu_notifier_subscriptions *subscriptions,
> > mn_itree_inv_start_range(subscriptions,
> > &range, &cur_seq);
> > interval_sub;
> > interval_sub = mn_itree_inv_next(interval_sub,
> > &range)) {
> > - ret = interval_sub->ops->invalidate(interval_sub,
> > &range,
> > - cur_seq);
> > + if (interval_sub->ops->invalidate_start) {
> > + struct mmu_interval_notifier_finish
> > *finish = NULL;
> > +
> > + ret = interval_sub->ops-
> > >invalidate_start(interval_sub,
> > +
> > &range,
> > +
> > cur_seq,
> > +
> > &finish);
> > + if (ret && finish) {
> > + finish->notifier = interval_sub;
> > + __llist_add(&finish->link,
> > &finish_passes);
> > + }
> > +
> > + } else {
> > + ret = interval_sub->ops-
> > >invalidate(interval_sub,
> > +
> > &range,
> > +
> > cur_seq);
> > + }
> > WARN_ON(!ret);
> > }
> >
> > + mn_itree_finish_pass(&finish_passes);
> > mn_itree_inv_end(subscriptions);
> > }
> >
> > @@ -430,7 +456,9 @@ static int mn_itree_invalidate(struct
> > mmu_notifier_subscriptions *subscriptions,
> > const struct mmu_notifier_range
> > *range)
> > {
> > struct mmu_interval_notifier *interval_sub;
> > + LLIST_HEAD(finish_passes);
> > unsigned long cur_seq;
> > + int err = 0;
> >
> > for (interval_sub =
> > mn_itree_inv_start_range(subscriptions,
> > range, &cur_seq);
> > @@ -438,23 +466,41 @@ static int mn_itree_invalidate(struct
> > mmu_notifier_subscriptions *subscriptions,
> > interval_sub = mn_itree_inv_next(interval_sub,
> > range)) {
> > bool ret;
> >
> > - ret = interval_sub->ops->invalidate(interval_sub,
> > range,
> > - cur_seq);
> > + if (interval_sub->ops->invalidate_start) {
> > + struct mmu_interval_notifier_finish
> > *finish = NULL;
> > +
> > + ret = interval_sub->ops-
> > >invalidate_start(interval_sub,
> > +
> > range,
> > +
> > cur_seq,
> > +
> > &finish);
> > + if (ret && finish) {
> > + finish->notifier = interval_sub;
> > + __llist_add(&finish->link,
> > &finish_passes);
> > + }
> > +
> > + } else {
> > + ret = interval_sub->ops-
> > >invalidate(interval_sub,
> > + range,
> > +
> > cur_seq);
> > + }
> > if (!ret) {
> > if
> > (WARN_ON(mmu_notifier_range_blockable(range)))
> > continue;
> > - goto out_would_block;
> > + err = -EAGAIN;
> > + break;
> > }
> > }
> > - return 0;
> >
> > -out_would_block:
> > + mn_itree_finish_pass(&finish_passes);
> > +
> > /*
> > * On -EAGAIN the non-blocking caller is not allowed to
> > call
> > * invalidate_range_end()
> > */
> > - mn_itree_inv_end(subscriptions);
> > - return -EAGAIN;
> > + if (err)
> > + mn_itree_inv_end(subscriptions);
> > +
> > + return err;
> > }
> >
> > static int mn_hlist_invalidate_range_start(
> > --
> > 2.53.0
> >
^ permalink raw reply [flat|nested] 21+ messages in thread
* Claude review: mm/mmu_notifier: Allow two-pass struct mmu_interval_notifiers
2026-03-02 16:32 ` [PATCH v2 1/4] mm/mmu_notifier: Allow two-pass struct mmu_interval_notifiers Thomas Hellström
2026-03-02 19:48 ` Matthew Brost
@ 2026-03-03 3:05 ` Claude Code Review Bot
1 sibling, 0 replies; 21+ messages in thread
From: Claude Code Review Bot @ 2026-03-03 3:05 UTC (permalink / raw)
To: dri-devel-reviews
Patch Review
This is the core infrastructure patch. The approach of using a lockless list (`llist`) to collect finish callbacks during the first-pass tree walk and then iterating them for the second pass is clean and avoids a redundant second interval tree traversal.
**Design observations:**
The documentation in `struct mmu_interval_notifier_finish` is good about recommending GFP_NOWAIT allocation and pools. However, the xe driver (in patch 2) embeds the struct instead, which is the superior approach for a single-notifier-at-a-time design.
**Concern: `invalidate_start` vs `invalidate` mutual exclusivity not enforced:**
The ops struct allows both `invalidate` and `invalidate_start` to be set simultaneously. The code checks `invalidate_start` first (lines 364, 404), so `invalidate` would never be called if both are present. This is fine in practice but a `WARN_ON(ops->invalidate && ops->invalidate_start)` during registration would be defensive.
**Correctness of error path in `mn_itree_invalidate`:**
```c
if (!ret) {
if (WARN_ON(mmu_notifier_range_blockable(range)))
continue;
err = -EAGAIN;
break;
}
```
When a non-blockable notifier returns false, the loop breaks, and then:
```c
mn_itree_finish_pass(&finish_passes);
if (err)
mn_itree_inv_end(subscriptions);
```
This correctly calls the finish callbacks for any notifiers that already succeeded before the failure. Good.
**Minor: llist reversal for ordering:**
```c
struct llist_node *first = llist_reverse_order(__llist_del_all(finish_passes));
```
Since `__llist_add` prepends, the list is in reverse iteration order. The reversal here restores original order, which is the right thing to do. However, the comment in the cover letter says this uses "lockless list for list management" - note that `__llist_add` (double-underscore) is actually *not* the lockless/atomic version, it's the non-atomic helper. This is fine since the list is local to each invocation, but the v2 changelog entry "Use lockless list for list management" is slightly misleading. The structure uses `llist_node`/`llist_head` types but doesn't use the atomic operations.
**Nit:** The v2 changelog entry says "Update documentation (GitHub Copilot:claude-sonnet-4.6)" - the Assisted-by tag uses "GitHub Copilot:claude-sonnet-4.6" which is an unusual attribution format.
---
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH v2 2/4] drm/xe/userptr: Convert invalidation to two-pass MMU notifier
2026-03-02 16:32 [PATCH v2 0/4] Two-pass MMU interval notifiers Thomas Hellström
2026-03-02 16:32 ` [PATCH v2 1/4] mm/mmu_notifier: Allow two-pass struct mmu_interval_notifiers Thomas Hellström
@ 2026-03-02 16:32 ` Thomas Hellström
2026-03-02 18:57 ` Matthew Brost
2026-03-03 3:05 ` Claude review: " Claude Code Review Bot
2026-03-02 16:32 ` [PATCH v2 3/4] drm/xe: Split TLB invalidation into submit and wait steps Thomas Hellström
` (2 subsequent siblings)
4 siblings, 2 replies; 21+ messages in thread
From: Thomas Hellström @ 2026-03-02 16:32 UTC (permalink / raw)
To: intel-xe
Cc: Thomas Hellström, Matthew Brost, Christian König,
dri-devel, Jason Gunthorpe, Andrew Morton, Simona Vetter,
Dave Airlie, Alistair Popple, linux-mm, linux-kernel
In multi-GPU scenarios, asynchronous GPU job latency is a bottleneck if
each notifier waits for its own GPU before returning. The two-pass
mmu_interval_notifier infrastructure allows deferring the wait to a
second pass, so all GPUs can be signalled in the first pass before
any of them are waited on.
Convert the userptr invalidation to use the two-pass model:
Use invalidate_start as the first pass to mark the VMA for repin and
enable software signalling on the VM reservation fences to start any
gpu work needed for signaling. Fall back to completing the work
synchronously if all fences are already signalled, or if a concurrent
invalidation is already using the embedded finish structure.
Use invalidate_finish as the second pass to wait for the reservation
fences to complete, invalidate the GPU TLB in fault mode, and unmap
the gpusvm pages.
Embed a struct mmu_interval_notifier_finish in struct xe_userptr to
avoid dynamic allocation in the notifier callback. Use a finish_inuse
flag to prevent two concurrent invalidations from using it
simultaneously; fall back to the synchronous path for the second caller.
Assisted-by: GitHub Copilot:claude-sonnet-4.6
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
drivers/gpu/drm/xe/xe_userptr.c | 96 +++++++++++++++++++++++++--------
drivers/gpu/drm/xe/xe_userptr.h | 14 +++++
2 files changed, 88 insertions(+), 22 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_userptr.c b/drivers/gpu/drm/xe/xe_userptr.c
index e120323c43bc..440b0a79d16f 100644
--- a/drivers/gpu/drm/xe/xe_userptr.c
+++ b/drivers/gpu/drm/xe/xe_userptr.c
@@ -73,18 +73,42 @@ int xe_vma_userptr_pin_pages(struct xe_userptr_vma *uvma)
&ctx);
}
-static void __vma_userptr_invalidate(struct xe_vm *vm, struct xe_userptr_vma *uvma)
+static void xe_vma_userptr_do_inval(struct xe_vm *vm, struct xe_userptr_vma *uvma,
+ bool is_deferred)
{
struct xe_userptr *userptr = &uvma->userptr;
struct xe_vma *vma = &uvma->vma;
- struct dma_resv_iter cursor;
- struct dma_fence *fence;
struct drm_gpusvm_ctx ctx = {
.in_notifier = true,
.read_only = xe_vma_read_only(vma),
};
long err;
+ err = dma_resv_wait_timeout(xe_vm_resv(vm),
+ DMA_RESV_USAGE_BOOKKEEP,
+ false, MAX_SCHEDULE_TIMEOUT);
+ XE_WARN_ON(err <= 0);
+
+ if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
+ err = xe_vm_invalidate_vma(vma);
+ XE_WARN_ON(err);
+ }
+
+ if (is_deferred)
+ userptr->finish_inuse = false;
+ drm_gpusvm_unmap_pages(&vm->svm.gpusvm, &uvma->userptr.pages,
+ xe_vma_size(vma) >> PAGE_SHIFT, &ctx);
+}
+
+static struct mmu_interval_notifier_finish *
+xe_vma_userptr_invalidate_pass1(struct xe_vm *vm, struct xe_userptr_vma *uvma)
+{
+ struct xe_userptr *userptr = &uvma->userptr;
+ struct xe_vma *vma = &uvma->vma;
+ struct dma_resv_iter cursor;
+ struct dma_fence *fence;
+ bool signaled = true;
+
/*
* Tell exec and rebind worker they need to repin and rebind this
* userptr.
@@ -105,27 +129,32 @@ static void __vma_userptr_invalidate(struct xe_vm *vm, struct xe_userptr_vma *uv
*/
dma_resv_iter_begin(&cursor, xe_vm_resv(vm),
DMA_RESV_USAGE_BOOKKEEP);
- dma_resv_for_each_fence_unlocked(&cursor, fence)
+ dma_resv_for_each_fence_unlocked(&cursor, fence) {
dma_fence_enable_sw_signaling(fence);
+ if (signaled && !dma_fence_is_signaled(fence))
+ signaled = false;
+ }
dma_resv_iter_end(&cursor);
- err = dma_resv_wait_timeout(xe_vm_resv(vm),
- DMA_RESV_USAGE_BOOKKEEP,
- false, MAX_SCHEDULE_TIMEOUT);
- XE_WARN_ON(err <= 0);
-
- if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
- err = xe_vm_invalidate_vma(vma);
- XE_WARN_ON(err);
+ /*
+ * Only one caller at a time can use the multi-pass state.
+ * If it's already in use, or all fences are already signaled,
+ * proceed directly to invalidation without deferring.
+ */
+ if (signaled || userptr->finish_inuse) {
+ xe_vma_userptr_do_inval(vm, uvma, false);
+ return NULL;
}
- drm_gpusvm_unmap_pages(&vm->svm.gpusvm, &uvma->userptr.pages,
- xe_vma_size(vma) >> PAGE_SHIFT, &ctx);
+ userptr->finish_inuse = true;
+
+ return &userptr->finish;
}
-static bool vma_userptr_invalidate(struct mmu_interval_notifier *mni,
- const struct mmu_notifier_range *range,
- unsigned long cur_seq)
+static bool xe_vma_userptr_invalidate_start(struct mmu_interval_notifier *mni,
+ const struct mmu_notifier_range *range,
+ unsigned long cur_seq,
+ struct mmu_interval_notifier_finish **p_finish)
{
struct xe_userptr_vma *uvma = container_of(mni, typeof(*uvma), userptr.notifier);
struct xe_vma *vma = &uvma->vma;
@@ -138,21 +167,40 @@ static bool vma_userptr_invalidate(struct mmu_interval_notifier *mni,
return false;
vm_dbg(&xe_vma_vm(vma)->xe->drm,
- "NOTIFIER: addr=0x%016llx, range=0x%016llx",
+ "NOTIFIER PASS1: addr=0x%016llx, range=0x%016llx",
xe_vma_start(vma), xe_vma_size(vma));
down_write(&vm->svm.gpusvm.notifier_lock);
mmu_interval_set_seq(mni, cur_seq);
- __vma_userptr_invalidate(vm, uvma);
+ *p_finish = xe_vma_userptr_invalidate_pass1(vm, uvma);
+
up_write(&vm->svm.gpusvm.notifier_lock);
- trace_xe_vma_userptr_invalidate_complete(vma);
+ if (!*p_finish)
+ trace_xe_vma_userptr_invalidate_complete(vma);
return true;
}
+static void xe_vma_userptr_invalidate_finish(struct mmu_interval_notifier_finish *finish)
+{
+ struct xe_userptr_vma *uvma = container_of(finish, typeof(*uvma), userptr.finish);
+ struct xe_vma *vma = &uvma->vma;
+ struct xe_vm *vm = xe_vma_vm(vma);
+
+ vm_dbg(&xe_vma_vm(vma)->xe->drm,
+ "NOTIFIER PASS2: addr=0x%016llx, range=0x%016llx",
+ xe_vma_start(vma), xe_vma_size(vma));
+
+ down_write(&vm->svm.gpusvm.notifier_lock);
+ xe_vma_userptr_do_inval(vm, uvma, true);
+ up_write(&vm->svm.gpusvm.notifier_lock);
+ trace_xe_vma_userptr_invalidate_complete(vma);
+}
+
static const struct mmu_interval_notifier_ops vma_userptr_notifier_ops = {
- .invalidate = vma_userptr_invalidate,
+ .invalidate_start = xe_vma_userptr_invalidate_start,
+ .invalidate_finish = xe_vma_userptr_invalidate_finish,
};
#if IS_ENABLED(CONFIG_DRM_XE_USERPTR_INVAL_INJECT)
@@ -164,6 +212,7 @@ static const struct mmu_interval_notifier_ops vma_userptr_notifier_ops = {
*/
void xe_vma_userptr_force_invalidate(struct xe_userptr_vma *uvma)
{
+ static struct mmu_interval_notifier_finish *finish;
struct xe_vm *vm = xe_vma_vm(&uvma->vma);
/* Protect against concurrent userptr pinning */
@@ -179,7 +228,10 @@ void xe_vma_userptr_force_invalidate(struct xe_userptr_vma *uvma)
if (!mmu_interval_read_retry(&uvma->userptr.notifier,
uvma->userptr.pages.notifier_seq))
uvma->userptr.pages.notifier_seq -= 2;
- __vma_userptr_invalidate(vm, uvma);
+
+ finish = xe_vma_userptr_invalidate_pass1(vm, uvma);
+ if (finish)
+ xe_vma_userptr_do_inval(vm, uvma, true);
}
#endif
diff --git a/drivers/gpu/drm/xe/xe_userptr.h b/drivers/gpu/drm/xe/xe_userptr.h
index ef801234991e..4f42db61fd62 100644
--- a/drivers/gpu/drm/xe/xe_userptr.h
+++ b/drivers/gpu/drm/xe/xe_userptr.h
@@ -57,12 +57,26 @@ struct xe_userptr {
*/
struct mmu_interval_notifier notifier;
+ /**
+ * @finish: MMU notifier finish structure for two-pass invalidation.
+ * Embedded here to avoid allocation in the notifier callback.
+ * Protected by @vm::svm.gpusvm.notifier_lock.
+ */
+ struct mmu_interval_notifier_finish finish;
+ /**
+ * @finish_inuse: Whether @finish is currently in use by an in-progress
+ * two-pass invalidation.
+ * Protected by @vm::svm.gpusvm.notifier_lock.
+ */
+ bool finish_inuse;
+
/**
* @initial_bind: user pointer has been bound at least once.
* write: vm->svm.gpusvm.notifier_lock in read mode and vm->resv held.
* read: vm->svm.gpusvm.notifier_lock in write mode or vm->resv held.
*/
bool initial_bind;
+
#if IS_ENABLED(CONFIG_DRM_XE_USERPTR_INVAL_INJECT)
u32 divisor;
#endif
--
2.53.0
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH v2 2/4] drm/xe/userptr: Convert invalidation to two-pass MMU notifier
2026-03-02 16:32 ` [PATCH v2 2/4] drm/xe/userptr: Convert invalidation to two-pass MMU notifier Thomas Hellström
@ 2026-03-02 18:57 ` Matthew Brost
2026-03-02 21:22 ` Thomas Hellström
2026-03-03 3:05 ` Claude review: " Claude Code Review Bot
1 sibling, 1 reply; 21+ messages in thread
From: Matthew Brost @ 2026-03-02 18:57 UTC (permalink / raw)
To: Thomas Hellström
Cc: intel-xe, Christian König, dri-devel, Jason Gunthorpe,
Andrew Morton, Simona Vetter, Dave Airlie, Alistair Popple,
linux-mm, linux-kernel
On Mon, Mar 02, 2026 at 05:32:46PM +0100, Thomas Hellström wrote:
> In multi-GPU scenarios, asynchronous GPU job latency is a bottleneck if
> each notifier waits for its own GPU before returning. The two-pass
> mmu_interval_notifier infrastructure allows deferring the wait to a
> second pass, so all GPUs can be signalled in the first pass before
> any of them are waited on.
>
> Convert the userptr invalidation to use the two-pass model:
>
> Use invalidate_start as the first pass to mark the VMA for repin and
> enable software signalling on the VM reservation fences to start any
> gpu work needed for signaling. Fall back to completing the work
> synchronously if all fences are already signalled, or if a concurrent
> invalidation is already using the embedded finish structure.
>
> Use invalidate_finish as the second pass to wait for the reservation
> fences to complete, invalidate the GPU TLB in fault mode, and unmap
> the gpusvm pages.
>
> Embed a struct mmu_interval_notifier_finish in struct xe_userptr to
> avoid dynamic allocation in the notifier callback. Use a finish_inuse
> flag to prevent two concurrent invalidations from using it
> simultaneously; fall back to the synchronous path for the second caller.
>
A couple nits below. Also for clarity, I'd probably rework this series...
- Move patch #3 to 2nd to patch
- Squash patch #2, #4 into a single patch, make thia the last patch
Let me know what you think on the reordering. I'm looking with the
series applied and aside from nits below everything in xe_userptr.c
looks good to me.
> Assisted-by: GitHub Copilot:claude-sonnet-4.6
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> ---
> drivers/gpu/drm/xe/xe_userptr.c | 96 +++++++++++++++++++++++++--------
> drivers/gpu/drm/xe/xe_userptr.h | 14 +++++
> 2 files changed, 88 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_userptr.c b/drivers/gpu/drm/xe/xe_userptr.c
> index e120323c43bc..440b0a79d16f 100644
> --- a/drivers/gpu/drm/xe/xe_userptr.c
> +++ b/drivers/gpu/drm/xe/xe_userptr.c
> @@ -73,18 +73,42 @@ int xe_vma_userptr_pin_pages(struct xe_userptr_vma *uvma)
> &ctx);
> }
>
> -static void __vma_userptr_invalidate(struct xe_vm *vm, struct xe_userptr_vma *uvma)
> +static void xe_vma_userptr_do_inval(struct xe_vm *vm, struct xe_userptr_vma *uvma,
> + bool is_deferred)
> {
> struct xe_userptr *userptr = &uvma->userptr;
> struct xe_vma *vma = &uvma->vma;
> - struct dma_resv_iter cursor;
> - struct dma_fence *fence;
> struct drm_gpusvm_ctx ctx = {
> .in_notifier = true,
> .read_only = xe_vma_read_only(vma),
> };
> long err;
>
xe_svm_assert_in_notifier(vm);
> + err = dma_resv_wait_timeout(xe_vm_resv(vm),
> + DMA_RESV_USAGE_BOOKKEEP,
> + false, MAX_SCHEDULE_TIMEOUT);
> + XE_WARN_ON(err <= 0);
> +
> + if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
> + err = xe_vm_invalidate_vma(vma);
> + XE_WARN_ON(err);
> + }
> +
> + if (is_deferred)
> + userptr->finish_inuse = false;
> + drm_gpusvm_unmap_pages(&vm->svm.gpusvm, &uvma->userptr.pages,
> + xe_vma_size(vma) >> PAGE_SHIFT, &ctx);
> +}
> +
> +static struct mmu_interval_notifier_finish *
> +xe_vma_userptr_invalidate_pass1(struct xe_vm *vm, struct xe_userptr_vma *uvma)
> +{
> + struct xe_userptr *userptr = &uvma->userptr;
> + struct xe_vma *vma = &uvma->vma;
> + struct dma_resv_iter cursor;
> + struct dma_fence *fence;
> + bool signaled = true;
> +
xe_svm_assert_in_notifier(vm);
> /*
> * Tell exec and rebind worker they need to repin and rebind this
> * userptr.
> @@ -105,27 +129,32 @@ static void __vma_userptr_invalidate(struct xe_vm *vm, struct xe_userptr_vma *uv
> */
> dma_resv_iter_begin(&cursor, xe_vm_resv(vm),
> DMA_RESV_USAGE_BOOKKEEP);
> - dma_resv_for_each_fence_unlocked(&cursor, fence)
> + dma_resv_for_each_fence_unlocked(&cursor, fence) {
> dma_fence_enable_sw_signaling(fence);
> + if (signaled && !dma_fence_is_signaled(fence))
> + signaled = false;
> + }
> dma_resv_iter_end(&cursor);
>
> - err = dma_resv_wait_timeout(xe_vm_resv(vm),
> - DMA_RESV_USAGE_BOOKKEEP,
> - false, MAX_SCHEDULE_TIMEOUT);
> - XE_WARN_ON(err <= 0);
> -
> - if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
> - err = xe_vm_invalidate_vma(vma);
> - XE_WARN_ON(err);
> + /*
> + * Only one caller at a time can use the multi-pass state.
> + * If it's already in use, or all fences are already signaled,
> + * proceed directly to invalidation without deferring.
> + */
> + if (signaled || userptr->finish_inuse) {
> + xe_vma_userptr_do_inval(vm, uvma, false);
> + return NULL;
> }
>
> - drm_gpusvm_unmap_pages(&vm->svm.gpusvm, &uvma->userptr.pages,
> - xe_vma_size(vma) >> PAGE_SHIFT, &ctx);
> + userptr->finish_inuse = true;
> +
> + return &userptr->finish;
> }
>
> -static bool vma_userptr_invalidate(struct mmu_interval_notifier *mni,
> - const struct mmu_notifier_range *range,
> - unsigned long cur_seq)
> +static bool xe_vma_userptr_invalidate_start(struct mmu_interval_notifier *mni,
> + const struct mmu_notifier_range *range,
> + unsigned long cur_seq,
> + struct mmu_interval_notifier_finish **p_finish)
> {
> struct xe_userptr_vma *uvma = container_of(mni, typeof(*uvma), userptr.notifier);
> struct xe_vma *vma = &uvma->vma;
> @@ -138,21 +167,40 @@ static bool vma_userptr_invalidate(struct mmu_interval_notifier *mni,
> return false;
>
> vm_dbg(&xe_vma_vm(vma)->xe->drm,
> - "NOTIFIER: addr=0x%016llx, range=0x%016llx",
> + "NOTIFIER PASS1: addr=0x%016llx, range=0x%016llx",
> xe_vma_start(vma), xe_vma_size(vma));
>
> down_write(&vm->svm.gpusvm.notifier_lock);
> mmu_interval_set_seq(mni, cur_seq);
>
> - __vma_userptr_invalidate(vm, uvma);
> + *p_finish = xe_vma_userptr_invalidate_pass1(vm, uvma);
> +
> up_write(&vm->svm.gpusvm.notifier_lock);
> - trace_xe_vma_userptr_invalidate_complete(vma);
> + if (!*p_finish)
> + trace_xe_vma_userptr_invalidate_complete(vma);
>
> return true;
> }
>
> +static void xe_vma_userptr_invalidate_finish(struct mmu_interval_notifier_finish *finish)
> +{
> + struct xe_userptr_vma *uvma = container_of(finish, typeof(*uvma), userptr.finish);
> + struct xe_vma *vma = &uvma->vma;
> + struct xe_vm *vm = xe_vma_vm(vma);
> +
> + vm_dbg(&xe_vma_vm(vma)->xe->drm,
> + "NOTIFIER PASS2: addr=0x%016llx, range=0x%016llx",
> + xe_vma_start(vma), xe_vma_size(vma));
> +
> + down_write(&vm->svm.gpusvm.notifier_lock);
> + xe_vma_userptr_do_inval(vm, uvma, true);
> + up_write(&vm->svm.gpusvm.notifier_lock);
> + trace_xe_vma_userptr_invalidate_complete(vma);
> +}
> +
> static const struct mmu_interval_notifier_ops vma_userptr_notifier_ops = {
> - .invalidate = vma_userptr_invalidate,
> + .invalidate_start = xe_vma_userptr_invalidate_start,
> + .invalidate_finish = xe_vma_userptr_invalidate_finish,
> };
>
> #if IS_ENABLED(CONFIG_DRM_XE_USERPTR_INVAL_INJECT)
> @@ -164,6 +212,7 @@ static const struct mmu_interval_notifier_ops vma_userptr_notifier_ops = {
> */
> void xe_vma_userptr_force_invalidate(struct xe_userptr_vma *uvma)
> {
> + static struct mmu_interval_notifier_finish *finish;
> struct xe_vm *vm = xe_vma_vm(&uvma->vma);
>
> /* Protect against concurrent userptr pinning */
> @@ -179,7 +228,10 @@ void xe_vma_userptr_force_invalidate(struct xe_userptr_vma *uvma)
> if (!mmu_interval_read_retry(&uvma->userptr.notifier,
> uvma->userptr.pages.notifier_seq))
> uvma->userptr.pages.notifier_seq -= 2;
> - __vma_userptr_invalidate(vm, uvma);
> +
> + finish = xe_vma_userptr_invalidate_pass1(vm, uvma);
> + if (finish)
> + xe_vma_userptr_do_inval(vm, uvma, true);
> }
> #endif
>
> diff --git a/drivers/gpu/drm/xe/xe_userptr.h b/drivers/gpu/drm/xe/xe_userptr.h
> index ef801234991e..4f42db61fd62 100644
> --- a/drivers/gpu/drm/xe/xe_userptr.h
> +++ b/drivers/gpu/drm/xe/xe_userptr.h
> @@ -57,12 +57,26 @@ struct xe_userptr {
> */
> struct mmu_interval_notifier notifier;
>
> + /**
> + * @finish: MMU notifier finish structure for two-pass invalidation.
> + * Embedded here to avoid allocation in the notifier callback.
> + * Protected by @vm::svm.gpusvm.notifier_lock.
> + */
> + struct mmu_interval_notifier_finish finish;
> + /**
> + * @finish_inuse: Whether @finish is currently in use by an in-progress
> + * two-pass invalidation.
> + * Protected by @vm::svm.gpusvm.notifier_lock.
> + */
> + bool finish_inuse;
> +
> /**
> * @initial_bind: user pointer has been bound at least once.
> * write: vm->svm.gpusvm.notifier_lock in read mode and vm->resv held.
> * read: vm->svm.gpusvm.notifier_lock in write mode or vm->resv held.
> */
> bool initial_bind;
> +
Unrelated.
Matt
> #if IS_ENABLED(CONFIG_DRM_XE_USERPTR_INVAL_INJECT)
> u32 divisor;
> #endif
> --
> 2.53.0
>
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH v2 2/4] drm/xe/userptr: Convert invalidation to two-pass MMU notifier
2026-03-02 18:57 ` Matthew Brost
@ 2026-03-02 21:22 ` Thomas Hellström
2026-03-02 21:30 ` Matthew Brost
0 siblings, 1 reply; 21+ messages in thread
From: Thomas Hellström @ 2026-03-02 21:22 UTC (permalink / raw)
To: Matthew Brost
Cc: intel-xe, Christian König, dri-devel, Jason Gunthorpe,
Andrew Morton, Simona Vetter, Dave Airlie, Alistair Popple,
linux-mm, linux-kernel
Hi,
On Mon, 2026-03-02 at 10:57 -0800, Matthew Brost wrote:
Thanks for reviewing,
> On Mon, Mar 02, 2026 at 05:32:46PM +0100, Thomas Hellström wrote:
> > In multi-GPU scenarios, asynchronous GPU job latency is a
> > bottleneck if
> > each notifier waits for its own GPU before returning. The two-pass
> > mmu_interval_notifier infrastructure allows deferring the wait to a
> > second pass, so all GPUs can be signalled in the first pass before
> > any of them are waited on.
> >
> > Convert the userptr invalidation to use the two-pass model:
> >
> > Use invalidate_start as the first pass to mark the VMA for repin
> > and
> > enable software signalling on the VM reservation fences to start
> > any
> > gpu work needed for signaling. Fall back to completing the work
> > synchronously if all fences are already signalled, or if a
> > concurrent
> > invalidation is already using the embedded finish structure.
> >
> > Use invalidate_finish as the second pass to wait for the
> > reservation
> > fences to complete, invalidate the GPU TLB in fault mode, and unmap
> > the gpusvm pages.
> >
> > Embed a struct mmu_interval_notifier_finish in struct xe_userptr to
> > avoid dynamic allocation in the notifier callback. Use a
> > finish_inuse
> > flag to prevent two concurrent invalidations from using it
> > simultaneously; fall back to the synchronous path for the second
> > caller.
> >
>
> A couple nits below. Also for clarity, I'd probably rework this
> series...
>
> - Move patch #3 to 2nd to patch
> - Squash patch #2, #4 into a single patch, make thia the last patch
>
> Let me know what you think on the reordering. I'm looking with the
> series applied and aside from nits below everything in xe_userptr.c
> looks good to me.
We could do that, but unless you insist, I'd like to keep the current
ordering since patch #2 is a very simple example on how to convert and
also since #4 makes the notifier rather complex so it'd be good to
be able to bisect in between those two.
>
> > Assisted-by: GitHub Copilot:claude-sonnet-4.6
> > Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > ---
> > drivers/gpu/drm/xe/xe_userptr.c | 96 +++++++++++++++++++++++++----
> > ----
> > drivers/gpu/drm/xe/xe_userptr.h | 14 +++++
> > 2 files changed, 88 insertions(+), 22 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_userptr.c
> > b/drivers/gpu/drm/xe/xe_userptr.c
> > index e120323c43bc..440b0a79d16f 100644
> > --- a/drivers/gpu/drm/xe/xe_userptr.c
> > +++ b/drivers/gpu/drm/xe/xe_userptr.c
> > @@ -73,18 +73,42 @@ int xe_vma_userptr_pin_pages(struct
> > xe_userptr_vma *uvma)
> > &ctx);
> > }
> >
> > -static void __vma_userptr_invalidate(struct xe_vm *vm, struct
> > xe_userptr_vma *uvma)
> > +static void xe_vma_userptr_do_inval(struct xe_vm *vm, struct
> > xe_userptr_vma *uvma,
> > + bool is_deferred)
> > {
> > struct xe_userptr *userptr = &uvma->userptr;
> > struct xe_vma *vma = &uvma->vma;
> > - struct dma_resv_iter cursor;
> > - struct dma_fence *fence;
> > struct drm_gpusvm_ctx ctx = {
> > .in_notifier = true,
> > .read_only = xe_vma_read_only(vma),
> > };
> > long err;
> >
>
> xe_svm_assert_in_notifier(vm);
This actually reveals a pre-existing bug. Since this code
is called with the notifier lock held in read mode, and
the vm resv held in the userptr invalidation injection.
That assert would hit.
Also drm_gpusvm_unmap_pages() below will assert the same thing, (also
affected by the bug)
but for clarity I agree we might want to have an assert here, but then
it would need to include the other mode as well, and I'd need to update
the locking docs for the two-pass state.
>
> > + err = dma_resv_wait_timeout(xe_vm_resv(vm),
> > + DMA_RESV_USAGE_BOOKKEEP,
> > + false, MAX_SCHEDULE_TIMEOUT);
> > + XE_WARN_ON(err <= 0);
> > +
> > + if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
> > + err = xe_vm_invalidate_vma(vma);
> > + XE_WARN_ON(err);
> > + }
> > +
> > + if (is_deferred)
> > + userptr->finish_inuse = false;
> > + drm_gpusvm_unmap_pages(&vm->svm.gpusvm, &uvma-
> > >userptr.pages,
> > + xe_vma_size(vma) >> PAGE_SHIFT,
> > &ctx);
> > +}
> > +
> > +static struct mmu_interval_notifier_finish *
> > +xe_vma_userptr_invalidate_pass1(struct xe_vm *vm, struct
> > xe_userptr_vma *uvma)
> > +{
> > + struct xe_userptr *userptr = &uvma->userptr;
> > + struct xe_vma *vma = &uvma->vma;
> > + struct dma_resv_iter cursor;
> > + struct dma_fence *fence;
> > + bool signaled = true;
> > +
>
> xe_svm_assert_in_notifier(vm);
Same here.
>
> > /*
> > * Tell exec and rebind worker they need to repin and
> > rebind this
> > * userptr.
> > @@ -105,27 +129,32 @@ static void __vma_userptr_invalidate(struct
> > xe_vm *vm, struct xe_userptr_vma *uv
> > */
> > dma_resv_iter_begin(&cursor, xe_vm_resv(vm),
> > DMA_RESV_USAGE_BOOKKEEP);
> > - dma_resv_for_each_fence_unlocked(&cursor, fence)
> > + dma_resv_for_each_fence_unlocked(&cursor, fence) {
> > dma_fence_enable_sw_signaling(fence);
> > + if (signaled && !dma_fence_is_signaled(fence))
> > + signaled = false;
> > + }
> > dma_resv_iter_end(&cursor);
> >
> > - err = dma_resv_wait_timeout(xe_vm_resv(vm),
> > - DMA_RESV_USAGE_BOOKKEEP,
> > - false, MAX_SCHEDULE_TIMEOUT);
> > - XE_WARN_ON(err <= 0);
> > -
> > - if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
> > - err = xe_vm_invalidate_vma(vma);
> > - XE_WARN_ON(err);
> > + /*
> > + * Only one caller at a time can use the multi-pass state.
> > + * If it's already in use, or all fences are already
> > signaled,
> > + * proceed directly to invalidation without deferring.
> > + */
> > + if (signaled || userptr->finish_inuse) {
> > + xe_vma_userptr_do_inval(vm, uvma, false);
> > + return NULL;
> > }
> >
> > - drm_gpusvm_unmap_pages(&vm->svm.gpusvm, &uvma-
> > >userptr.pages,
> > - xe_vma_size(vma) >> PAGE_SHIFT,
> > &ctx);
> > + userptr->finish_inuse = true;
> > +
> > + return &userptr->finish;
> > }
> >
> > -static bool vma_userptr_invalidate(struct mmu_interval_notifier
> > *mni,
> > - const struct mmu_notifier_range
> > *range,
> > - unsigned long cur_seq)
> > +static bool xe_vma_userptr_invalidate_start(struct
> > mmu_interval_notifier *mni,
> > + const struct
> > mmu_notifier_range *range,
> > + unsigned long cur_seq,
> > + struct
> > mmu_interval_notifier_finish **p_finish)
> > {
> > struct xe_userptr_vma *uvma = container_of(mni,
> > typeof(*uvma), userptr.notifier);
> > struct xe_vma *vma = &uvma->vma;
> > @@ -138,21 +167,40 @@ static bool vma_userptr_invalidate(struct
> > mmu_interval_notifier *mni,
> > return false;
> >
> > vm_dbg(&xe_vma_vm(vma)->xe->drm,
> > - "NOTIFIER: addr=0x%016llx, range=0x%016llx",
> > + "NOTIFIER PASS1: addr=0x%016llx, range=0x%016llx",
> > xe_vma_start(vma), xe_vma_size(vma));
> >
> > down_write(&vm->svm.gpusvm.notifier_lock);
> > mmu_interval_set_seq(mni, cur_seq);
> >
> > - __vma_userptr_invalidate(vm, uvma);
> > + *p_finish = xe_vma_userptr_invalidate_pass1(vm, uvma);
> > +
> > up_write(&vm->svm.gpusvm.notifier_lock);
> > - trace_xe_vma_userptr_invalidate_complete(vma);
> > + if (!*p_finish)
> > + trace_xe_vma_userptr_invalidate_complete(vma);
> >
> > return true;
> > }
> >
> > +static void xe_vma_userptr_invalidate_finish(struct
> > mmu_interval_notifier_finish *finish)
> > +{
> > + struct xe_userptr_vma *uvma = container_of(finish,
> > typeof(*uvma), userptr.finish);
> > + struct xe_vma *vma = &uvma->vma;
> > + struct xe_vm *vm = xe_vma_vm(vma);
> > +
> > + vm_dbg(&xe_vma_vm(vma)->xe->drm,
> > + "NOTIFIER PASS2: addr=0x%016llx, range=0x%016llx",
> > + xe_vma_start(vma), xe_vma_size(vma));
> > +
> > + down_write(&vm->svm.gpusvm.notifier_lock);
> > + xe_vma_userptr_do_inval(vm, uvma, true);
> > + up_write(&vm->svm.gpusvm.notifier_lock);
> > + trace_xe_vma_userptr_invalidate_complete(vma);
> > +}
> > +
> > static const struct mmu_interval_notifier_ops
> > vma_userptr_notifier_ops = {
> > - .invalidate = vma_userptr_invalidate,
> > + .invalidate_start = xe_vma_userptr_invalidate_start,
> > + .invalidate_finish = xe_vma_userptr_invalidate_finish,
> > };
> >
> > #if IS_ENABLED(CONFIG_DRM_XE_USERPTR_INVAL_INJECT)
> > @@ -164,6 +212,7 @@ static const struct mmu_interval_notifier_ops
> > vma_userptr_notifier_ops = {
> > */
> > void xe_vma_userptr_force_invalidate(struct xe_userptr_vma *uvma)
> > {
> > + static struct mmu_interval_notifier_finish *finish;
> > struct xe_vm *vm = xe_vma_vm(&uvma->vma);
> >
> > /* Protect against concurrent userptr pinning */
> > @@ -179,7 +228,10 @@ void xe_vma_userptr_force_invalidate(struct
> > xe_userptr_vma *uvma)
> > if (!mmu_interval_read_retry(&uvma->userptr.notifier,
> > uvma-
> > >userptr.pages.notifier_seq))
> > uvma->userptr.pages.notifier_seq -= 2;
> > - __vma_userptr_invalidate(vm, uvma);
> > +
> > + finish = xe_vma_userptr_invalidate_pass1(vm, uvma);
> > + if (finish)
> > + xe_vma_userptr_do_inval(vm, uvma, true);
> > }
> > #endif
> >
> > diff --git a/drivers/gpu/drm/xe/xe_userptr.h
> > b/drivers/gpu/drm/xe/xe_userptr.h
> > index ef801234991e..4f42db61fd62 100644
> > --- a/drivers/gpu/drm/xe/xe_userptr.h
> > +++ b/drivers/gpu/drm/xe/xe_userptr.h
> > @@ -57,12 +57,26 @@ struct xe_userptr {
> > */
> > struct mmu_interval_notifier notifier;
> >
> > + /**
> > + * @finish: MMU notifier finish structure for two-pass
> > invalidation.
> > + * Embedded here to avoid allocation in the notifier
> > callback.
> > + * Protected by @vm::svm.gpusvm.notifier_lock.
> > + */
> > + struct mmu_interval_notifier_finish finish;
> > + /**
> > + * @finish_inuse: Whether @finish is currently in use by
> > an in-progress
> > + * two-pass invalidation.
> > + * Protected by @vm::svm.gpusvm.notifier_lock.
> > + */
> > + bool finish_inuse;
> > +
> > /**
> > * @initial_bind: user pointer has been bound at least
> > once.
> > * write: vm->svm.gpusvm.notifier_lock in read mode and
> > vm->resv held.
> > * read: vm->svm.gpusvm.notifier_lock in write mode or vm-
> > >resv held.
> > */
> > bool initial_bind;
> > +
>
> Unrelated.
Sure. Will fix.
Thanks,
Thomas
>
> Matt
>
> > #if IS_ENABLED(CONFIG_DRM_XE_USERPTR_INVAL_INJECT)
> > u32 divisor;
> > #endif
> > --
> > 2.53.0
> >
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH v2 2/4] drm/xe/userptr: Convert invalidation to two-pass MMU notifier
2026-03-02 21:22 ` Thomas Hellström
@ 2026-03-02 21:30 ` Matthew Brost
0 siblings, 0 replies; 21+ messages in thread
From: Matthew Brost @ 2026-03-02 21:30 UTC (permalink / raw)
To: Thomas Hellström
Cc: intel-xe, Christian König, dri-devel, Jason Gunthorpe,
Andrew Morton, Simona Vetter, Dave Airlie, Alistair Popple,
linux-mm, linux-kernel
On Mon, Mar 02, 2026 at 10:22:13PM +0100, Thomas Hellström wrote:
> Hi,
>
> On Mon, 2026-03-02 at 10:57 -0800, Matthew Brost wrote:
>
> Thanks for reviewing,
>
> > On Mon, Mar 02, 2026 at 05:32:46PM +0100, Thomas Hellström wrote:
> > > In multi-GPU scenarios, asynchronous GPU job latency is a
> > > bottleneck if
> > > each notifier waits for its own GPU before returning. The two-pass
> > > mmu_interval_notifier infrastructure allows deferring the wait to a
> > > second pass, so all GPUs can be signalled in the first pass before
> > > any of them are waited on.
> > >
> > > Convert the userptr invalidation to use the two-pass model:
> > >
> > > Use invalidate_start as the first pass to mark the VMA for repin
> > > and
> > > enable software signalling on the VM reservation fences to start
> > > any
> > > gpu work needed for signaling. Fall back to completing the work
> > > synchronously if all fences are already signalled, or if a
> > > concurrent
> > > invalidation is already using the embedded finish structure.
> > >
> > > Use invalidate_finish as the second pass to wait for the
> > > reservation
> > > fences to complete, invalidate the GPU TLB in fault mode, and unmap
> > > the gpusvm pages.
> > >
> > > Embed a struct mmu_interval_notifier_finish in struct xe_userptr to
> > > avoid dynamic allocation in the notifier callback. Use a
> > > finish_inuse
> > > flag to prevent two concurrent invalidations from using it
> > > simultaneously; fall back to the synchronous path for the second
> > > caller.
> > >
> >
> > A couple nits below. Also for clarity, I'd probably rework this
> > series...
> >
> > - Move patch #3 to 2nd to patch
> > - Squash patch #2, #4 into a single patch, make thia the last patch
> >
> > Let me know what you think on the reordering. I'm looking with the
> > series applied and aside from nits below everything in xe_userptr.c
> > looks good to me.
>
> We could do that, but unless you insist, I'd like to keep the current
> ordering since patch #2 is a very simple example on how to convert and
> also since #4 makes the notifier rather complex so it'd be good to
> be able to bisect in between those two.
>
I think it is fine the way you have it - I never have strong opinions of
stuff like this. For me to review, I just had to apply the series to a
branch to get a full picture / verify all flows were correct.
> >
> > > Assisted-by: GitHub Copilot:claude-sonnet-4.6
> > > Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > ---
> > > drivers/gpu/drm/xe/xe_userptr.c | 96 +++++++++++++++++++++++++----
> > > ----
> > > drivers/gpu/drm/xe/xe_userptr.h | 14 +++++
> > > 2 files changed, 88 insertions(+), 22 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_userptr.c
> > > b/drivers/gpu/drm/xe/xe_userptr.c
> > > index e120323c43bc..440b0a79d16f 100644
> > > --- a/drivers/gpu/drm/xe/xe_userptr.c
> > > +++ b/drivers/gpu/drm/xe/xe_userptr.c
> > > @@ -73,18 +73,42 @@ int xe_vma_userptr_pin_pages(struct
> > > xe_userptr_vma *uvma)
> > > &ctx);
> > > }
> > >
> > > -static void __vma_userptr_invalidate(struct xe_vm *vm, struct
> > > xe_userptr_vma *uvma)
> > > +static void xe_vma_userptr_do_inval(struct xe_vm *vm, struct
> > > xe_userptr_vma *uvma,
> > > + bool is_deferred)
> > > {
> > > struct xe_userptr *userptr = &uvma->userptr;
> > > struct xe_vma *vma = &uvma->vma;
> > > - struct dma_resv_iter cursor;
> > > - struct dma_fence *fence;
> > > struct drm_gpusvm_ctx ctx = {
> > > .in_notifier = true,
> > > .read_only = xe_vma_read_only(vma),
> > > };
> > > long err;
> > >
> >
> > xe_svm_assert_in_notifier(vm);
>
> This actually reveals a pre-existing bug. Since this code
> is called with the notifier lock held in read mode, and
> the vm resv held in the userptr invalidation injection.
> That assert would hit.
>
>
> Also drm_gpusvm_unmap_pages() below will assert the same thing, (also
> affected by the bug)
> but for clarity I agree we might want to have an assert here, but then
> it would need to include the other mode as well, and I'd need to update
> the locking docs for the two-pass state.
>
Right! I have pointed this out Auld as this popped in my testing
implementing the CPU bind / ULLS series as I enabled
DRM_XE_USERPTR_INVAL_INJECT in my testing. We should fixup
drm_gpusvm_unmap_pages soonish too + enable DRM_XE_USERPTR_INVAL_INJECT
in some CI runs too.
I think some lockdep assert would be good - even if it just the
non-write mode version with maybe a brief explaination the error
injection path can do this in read mode while the notifier path does
this in write mode.
Matt
> >
> > > + err = dma_resv_wait_timeout(xe_vm_resv(vm),
> > > + DMA_RESV_USAGE_BOOKKEEP,
> > > + false, MAX_SCHEDULE_TIMEOUT);
> > > + XE_WARN_ON(err <= 0);
> > > +
> > > + if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
> > > + err = xe_vm_invalidate_vma(vma);
> > > + XE_WARN_ON(err);
> > > + }
> > > +
> > > + if (is_deferred)
> > > + userptr->finish_inuse = false;
> > > + drm_gpusvm_unmap_pages(&vm->svm.gpusvm, &uvma-
> > > >userptr.pages,
> > > + xe_vma_size(vma) >> PAGE_SHIFT,
> > > &ctx);
> > > +}
> > > +
> > > +static struct mmu_interval_notifier_finish *
> > > +xe_vma_userptr_invalidate_pass1(struct xe_vm *vm, struct
> > > xe_userptr_vma *uvma)
> > > +{
> > > + struct xe_userptr *userptr = &uvma->userptr;
> > > + struct xe_vma *vma = &uvma->vma;
> > > + struct dma_resv_iter cursor;
> > > + struct dma_fence *fence;
> > > + bool signaled = true;
> > > +
> >
> > xe_svm_assert_in_notifier(vm);
>
> Same here.
>
> >
> > > /*
> > > * Tell exec and rebind worker they need to repin and
> > > rebind this
> > > * userptr.
> > > @@ -105,27 +129,32 @@ static void __vma_userptr_invalidate(struct
> > > xe_vm *vm, struct xe_userptr_vma *uv
> > > */
> > > dma_resv_iter_begin(&cursor, xe_vm_resv(vm),
> > > DMA_RESV_USAGE_BOOKKEEP);
> > > - dma_resv_for_each_fence_unlocked(&cursor, fence)
> > > + dma_resv_for_each_fence_unlocked(&cursor, fence) {
> > > dma_fence_enable_sw_signaling(fence);
> > > + if (signaled && !dma_fence_is_signaled(fence))
> > > + signaled = false;
> > > + }
> > > dma_resv_iter_end(&cursor);
> > >
> > > - err = dma_resv_wait_timeout(xe_vm_resv(vm),
> > > - DMA_RESV_USAGE_BOOKKEEP,
> > > - false, MAX_SCHEDULE_TIMEOUT);
> > > - XE_WARN_ON(err <= 0);
> > > -
> > > - if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
> > > - err = xe_vm_invalidate_vma(vma);
> > > - XE_WARN_ON(err);
> > > + /*
> > > + * Only one caller at a time can use the multi-pass state.
> > > + * If it's already in use, or all fences are already
> > > signaled,
> > > + * proceed directly to invalidation without deferring.
> > > + */
> > > + if (signaled || userptr->finish_inuse) {
> > > + xe_vma_userptr_do_inval(vm, uvma, false);
> > > + return NULL;
> > > }
> > >
> > > - drm_gpusvm_unmap_pages(&vm->svm.gpusvm, &uvma-
> > > >userptr.pages,
> > > - xe_vma_size(vma) >> PAGE_SHIFT,
> > > &ctx);
> > > + userptr->finish_inuse = true;
> > > +
> > > + return &userptr->finish;
> > > }
> > >
> > > -static bool vma_userptr_invalidate(struct mmu_interval_notifier
> > > *mni,
> > > - const struct mmu_notifier_range
> > > *range,
> > > - unsigned long cur_seq)
> > > +static bool xe_vma_userptr_invalidate_start(struct
> > > mmu_interval_notifier *mni,
> > > + const struct
> > > mmu_notifier_range *range,
> > > + unsigned long cur_seq,
> > > + struct
> > > mmu_interval_notifier_finish **p_finish)
> > > {
> > > struct xe_userptr_vma *uvma = container_of(mni,
> > > typeof(*uvma), userptr.notifier);
> > > struct xe_vma *vma = &uvma->vma;
> > > @@ -138,21 +167,40 @@ static bool vma_userptr_invalidate(struct
> > > mmu_interval_notifier *mni,
> > > return false;
> > >
> > > vm_dbg(&xe_vma_vm(vma)->xe->drm,
> > > - "NOTIFIER: addr=0x%016llx, range=0x%016llx",
> > > + "NOTIFIER PASS1: addr=0x%016llx, range=0x%016llx",
> > > xe_vma_start(vma), xe_vma_size(vma));
> > >
> > > down_write(&vm->svm.gpusvm.notifier_lock);
> > > mmu_interval_set_seq(mni, cur_seq);
> > >
> > > - __vma_userptr_invalidate(vm, uvma);
> > > + *p_finish = xe_vma_userptr_invalidate_pass1(vm, uvma);
> > > +
> > > up_write(&vm->svm.gpusvm.notifier_lock);
> > > - trace_xe_vma_userptr_invalidate_complete(vma);
> > > + if (!*p_finish)
> > > + trace_xe_vma_userptr_invalidate_complete(vma);
> > >
> > > return true;
> > > }
> > >
> > > +static void xe_vma_userptr_invalidate_finish(struct
> > > mmu_interval_notifier_finish *finish)
> > > +{
> > > + struct xe_userptr_vma *uvma = container_of(finish,
> > > typeof(*uvma), userptr.finish);
> > > + struct xe_vma *vma = &uvma->vma;
> > > + struct xe_vm *vm = xe_vma_vm(vma);
> > > +
> > > + vm_dbg(&xe_vma_vm(vma)->xe->drm,
> > > + "NOTIFIER PASS2: addr=0x%016llx, range=0x%016llx",
> > > + xe_vma_start(vma), xe_vma_size(vma));
> > > +
> > > + down_write(&vm->svm.gpusvm.notifier_lock);
> > > + xe_vma_userptr_do_inval(vm, uvma, true);
> > > + up_write(&vm->svm.gpusvm.notifier_lock);
> > > + trace_xe_vma_userptr_invalidate_complete(vma);
> > > +}
> > > +
> > > static const struct mmu_interval_notifier_ops
> > > vma_userptr_notifier_ops = {
> > > - .invalidate = vma_userptr_invalidate,
> > > + .invalidate_start = xe_vma_userptr_invalidate_start,
> > > + .invalidate_finish = xe_vma_userptr_invalidate_finish,
> > > };
> > >
> > > #if IS_ENABLED(CONFIG_DRM_XE_USERPTR_INVAL_INJECT)
> > > @@ -164,6 +212,7 @@ static const struct mmu_interval_notifier_ops
> > > vma_userptr_notifier_ops = {
> > > */
> > > void xe_vma_userptr_force_invalidate(struct xe_userptr_vma *uvma)
> > > {
> > > + static struct mmu_interval_notifier_finish *finish;
> > > struct xe_vm *vm = xe_vma_vm(&uvma->vma);
> > >
> > > /* Protect against concurrent userptr pinning */
> > > @@ -179,7 +228,10 @@ void xe_vma_userptr_force_invalidate(struct
> > > xe_userptr_vma *uvma)
> > > if (!mmu_interval_read_retry(&uvma->userptr.notifier,
> > > uvma-
> > > >userptr.pages.notifier_seq))
> > > uvma->userptr.pages.notifier_seq -= 2;
> > > - __vma_userptr_invalidate(vm, uvma);
> > > +
> > > + finish = xe_vma_userptr_invalidate_pass1(vm, uvma);
> > > + if (finish)
> > > + xe_vma_userptr_do_inval(vm, uvma, true);
> > > }
> > > #endif
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_userptr.h
> > > b/drivers/gpu/drm/xe/xe_userptr.h
> > > index ef801234991e..4f42db61fd62 100644
> > > --- a/drivers/gpu/drm/xe/xe_userptr.h
> > > +++ b/drivers/gpu/drm/xe/xe_userptr.h
> > > @@ -57,12 +57,26 @@ struct xe_userptr {
> > > */
> > > struct mmu_interval_notifier notifier;
> > >
> > > + /**
> > > + * @finish: MMU notifier finish structure for two-pass
> > > invalidation.
> > > + * Embedded here to avoid allocation in the notifier
> > > callback.
> > > + * Protected by @vm::svm.gpusvm.notifier_lock.
> > > + */
> > > + struct mmu_interval_notifier_finish finish;
> > > + /**
> > > + * @finish_inuse: Whether @finish is currently in use by
> > > an in-progress
> > > + * two-pass invalidation.
> > > + * Protected by @vm::svm.gpusvm.notifier_lock.
> > > + */
> > > + bool finish_inuse;
> > > +
> > > /**
> > > * @initial_bind: user pointer has been bound at least
> > > once.
> > > * write: vm->svm.gpusvm.notifier_lock in read mode and
> > > vm->resv held.
> > > * read: vm->svm.gpusvm.notifier_lock in write mode or vm-
> > > >resv held.
> > > */
> > > bool initial_bind;
> > > +
> >
> > Unrelated.
>
> Sure. Will fix.
>
> Thanks,
> Thomas
>
>
>
> >
> > Matt
> >
> > > #if IS_ENABLED(CONFIG_DRM_XE_USERPTR_INVAL_INJECT)
> > > u32 divisor;
> > > #endif
> > > --
> > > 2.53.0
> > >
^ permalink raw reply [flat|nested] 21+ messages in thread
* Claude review: drm/xe/userptr: Convert invalidation to two-pass MMU notifier
2026-03-02 16:32 ` [PATCH v2 2/4] drm/xe/userptr: Convert invalidation to two-pass MMU notifier Thomas Hellström
2026-03-02 18:57 ` Matthew Brost
@ 2026-03-03 3:05 ` Claude Code Review Bot
1 sibling, 0 replies; 21+ messages in thread
From: Claude Code Review Bot @ 2026-03-03 3:05 UTC (permalink / raw)
To: dri-devel-reviews
Patch Review
This patch converts the xe userptr invalidation to use the two-pass model from patch 1. The split into `xe_vma_userptr_invalidate_pass1` (enable SW signaling, check if fences are already signaled) and `xe_vma_userptr_do_inval` (wait for fences, invalidate TLB, unmap pages) is clean.
**Bug: `static` local variable in `xe_vma_userptr_force_invalidate`:**
```c
void xe_vma_userptr_force_invalidate(struct xe_userptr_vma *uvma)
{
static struct mmu_interval_notifier_finish *finish;
```
This `static` is almost certainly a bug. A local pointer to the finish struct should not be `static` - this serves no purpose and would be shared across all calls. If two threads call `xe_vma_userptr_force_invalidate` concurrently, the `static` variable creates a data race. This should be a plain local variable. (It appears this may have been introduced by the AI assistant.)
**`finish_inuse` flag race consideration:**
The `finish_inuse` flag is documented as protected by `vm::svm.gpusvm.notifier_lock`. In `xe_vma_userptr_invalidate_pass1`, it's checked and set under this lock. In `xe_vma_userptr_do_inval`, it's cleared under the lock in `xe_vma_userptr_invalidate_finish`. In the `xe_vma_userptr_invalidate_start` callback, the lock is held for the duration of the pass1 call. This looks correct.
**Locking in `xe_vma_userptr_invalidate_finish`:**
```c
down_write(&vm->svm.gpusvm.notifier_lock);
xe_vma_userptr_do_inval(vm, uvma, true);
up_write(&vm->svm.gpusvm.notifier_lock);
```
The finish callback takes the notifier_lock to protect the shared state. This is called from `mn_itree_finish_pass` which runs after the interval tree walk but before `mn_itree_inv_end`. The `mn_itree_inv_end` call releases the write side of `subscriptions->invalidate_seq`. So this should be safe - the finish pass runs in the same context as the original invalidation.
**Question: What happens when `invalidate_start` returns `true` but `*p_finish` is NULL?**
Looking at the `mn_itree_invalidate` code, when `ret` is true but `finish` is NULL, the notifier simply isn't added to the finish list. This is the correct "completed synchronously" path.
---
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH v2 3/4] drm/xe: Split TLB invalidation into submit and wait steps
2026-03-02 16:32 [PATCH v2 0/4] Two-pass MMU interval notifiers Thomas Hellström
2026-03-02 16:32 ` [PATCH v2 1/4] mm/mmu_notifier: Allow two-pass struct mmu_interval_notifiers Thomas Hellström
2026-03-02 16:32 ` [PATCH v2 2/4] drm/xe/userptr: Convert invalidation to two-pass MMU notifier Thomas Hellström
@ 2026-03-02 16:32 ` Thomas Hellström
2026-03-02 19:06 ` Matthew Brost
2026-03-03 3:05 ` Claude review: " Claude Code Review Bot
2026-03-02 16:32 ` [PATCH v2 4/4] drm/xe/userptr: Defer Waiting for TLB invalidation to the second pass if possible Thomas Hellström
2026-03-03 3:05 ` Claude review: Two-pass MMU interval notifiers Claude Code Review Bot
4 siblings, 2 replies; 21+ messages in thread
From: Thomas Hellström @ 2026-03-02 16:32 UTC (permalink / raw)
To: intel-xe
Cc: Thomas Hellström, Matthew Brost, Christian König,
dri-devel, Jason Gunthorpe, Andrew Morton, Simona Vetter,
Dave Airlie, Alistair Popple, linux-mm, linux-kernel
xe_vm_range_tilemask_tlb_inval() submits TLB invalidation requests to
all GTs in a tile mask and then immediately waits for them to complete
before returning. This is fine for the existing callers, but a
subsequent patch will need to defer the wait in order to overlap TLB
invalidations across multiple VMAs.
Introduce xe_tlb_inval_range_tilemask_submit() and
xe_tlb_inval_batch_wait() in xe_tlb_inval.c as the submit and wait
halves respectively. The batch of fences is carried in the new
xe_tlb_inval_batch structure. Remove xe_vm_range_tilemask_tlb_inval()
and convert all three call sites to the new API.
Assisted-by: GitHub Copilot:claude-sonnet-4.6
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
drivers/gpu/drm/xe/xe_svm.c | 6 +-
drivers/gpu/drm/xe/xe_tlb_inval.c | 82 +++++++++++++++++++++++++
drivers/gpu/drm/xe/xe_tlb_inval.h | 6 ++
drivers/gpu/drm/xe/xe_tlb_inval_types.h | 14 +++++
drivers/gpu/drm/xe/xe_vm.c | 69 +++------------------
drivers/gpu/drm/xe/xe_vm.h | 3 -
drivers/gpu/drm/xe/xe_vm_madvise.c | 9 ++-
drivers/gpu/drm/xe/xe_vm_types.h | 1 +
8 files changed, 123 insertions(+), 67 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 002b6c22ad3f..6ea4972c2791 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -19,6 +19,7 @@
#include "xe_pt.h"
#include "xe_svm.h"
#include "xe_tile.h"
+#include "xe_tlb_inval.h"
#include "xe_ttm_vram_mgr.h"
#include "xe_vm.h"
#include "xe_vm_types.h"
@@ -225,6 +226,7 @@ static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
const struct mmu_notifier_range *mmu_range)
{
struct xe_vm *vm = gpusvm_to_vm(gpusvm);
+ struct xe_tlb_inval_batch _batch;
struct xe_device *xe = vm->xe;
struct drm_gpusvm_range *r, *first;
struct xe_tile *tile;
@@ -276,7 +278,9 @@ static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
xe_device_wmb(xe);
- err = xe_vm_range_tilemask_tlb_inval(vm, adj_start, adj_end, tile_mask);
+ err = xe_tlb_inval_range_tilemask_submit(xe, vm->usm.asid, adj_start, adj_end,
+ tile_mask, &_batch);
+ xe_tlb_inval_batch_wait(&_batch);
WARN_ON_ONCE(err);
range_notifier_event_end:
diff --git a/drivers/gpu/drm/xe/xe_tlb_inval.c b/drivers/gpu/drm/xe/xe_tlb_inval.c
index 933f30fb617d..343e37cfe715 100644
--- a/drivers/gpu/drm/xe/xe_tlb_inval.c
+++ b/drivers/gpu/drm/xe/xe_tlb_inval.c
@@ -486,3 +486,85 @@ bool xe_tlb_inval_idle(struct xe_tlb_inval *tlb_inval)
guard(spinlock_irq)(&tlb_inval->pending_lock);
return list_is_singular(&tlb_inval->pending_fences);
}
+
+/**
+ * xe_tlb_inval_batch_wait() - Wait for all fences in a TLB invalidation batch
+ * @batch: Batch of TLB invalidation fences to wait on
+ *
+ * Waits for every fence in @batch to signal, then resets @batch so it can be
+ * reused for a subsequent invalidation.
+ */
+void xe_tlb_inval_batch_wait(struct xe_tlb_inval_batch *batch)
+{
+ struct xe_tlb_inval_fence *fence = &batch->fence[0];
+ unsigned int i;
+
+ for (i = 0; i < batch->num_fences; ++i)
+ xe_tlb_inval_fence_wait(fence++);
+
+ batch->num_fences = 0;
+}
+
+/**
+ * xe_tlb_inval_range_tilemask_submit() - Submit TLB invalidations for an
+ * address range on a tile mask
+ * @xe: The xe device
+ * @asid: Address space ID
+ * @start: start address
+ * @end: end address
+ * @tile_mask: mask for which gt's issue tlb invalidation
+ * @batch: Batch of tlb invalidate fences
+ *
+ * Issue a range based TLB invalidation for gt's in tilemask
+ *
+ * Returns 0 for success, negative error code otherwise.
+ */
+int xe_tlb_inval_range_tilemask_submit(struct xe_device *xe, u32 asid,
+ u64 start, u64 end, u8 tile_mask,
+ struct xe_tlb_inval_batch *batch)
+{
+ struct xe_tlb_inval_fence *fence = &batch->fence[0];
+ struct xe_tile *tile;
+ u32 fence_id = 0;
+ u8 id;
+ int err;
+
+ batch->num_fences = 0;
+ if (!tile_mask)
+ return 0;
+
+ for_each_tile(tile, xe, id) {
+ if (!(tile_mask & BIT(id)))
+ continue;
+
+ xe_tlb_inval_fence_init(&tile->primary_gt->tlb_inval,
+ &fence[fence_id], true);
+
+ err = xe_tlb_inval_range(&tile->primary_gt->tlb_inval,
+ &fence[fence_id], start, end,
+ asid, NULL);
+ if (err)
+ goto wait;
+ ++fence_id;
+
+ if (!tile->media_gt)
+ continue;
+
+ xe_tlb_inval_fence_init(&tile->media_gt->tlb_inval,
+ &fence[fence_id], true);
+
+ err = xe_tlb_inval_range(&tile->media_gt->tlb_inval,
+ &fence[fence_id], start, end,
+ asid, NULL);
+ if (err)
+ goto wait;
+ ++fence_id;
+ }
+
+wait:
+ batch->num_fences = fence_id;
+ if (err)
+ xe_tlb_inval_batch_wait(batch);
+
+ return err;
+}
diff --git a/drivers/gpu/drm/xe/xe_tlb_inval.h b/drivers/gpu/drm/xe/xe_tlb_inval.h
index 62089254fa23..a76b7823a5f2 100644
--- a/drivers/gpu/drm/xe/xe_tlb_inval.h
+++ b/drivers/gpu/drm/xe/xe_tlb_inval.h
@@ -45,4 +45,10 @@ void xe_tlb_inval_done_handler(struct xe_tlb_inval *tlb_inval, int seqno);
bool xe_tlb_inval_idle(struct xe_tlb_inval *tlb_inval);
+int xe_tlb_inval_range_tilemask_submit(struct xe_device *xe, u32 asid,
+ u64 start, u64 end, u8 tile_mask,
+ struct xe_tlb_inval_batch *batch);
+
+void xe_tlb_inval_batch_wait(struct xe_tlb_inval_batch *batch);
+
#endif /* _XE_TLB_INVAL_ */
diff --git a/drivers/gpu/drm/xe/xe_tlb_inval_types.h b/drivers/gpu/drm/xe/xe_tlb_inval_types.h
index 3b089f90f002..3d1797d186fd 100644
--- a/drivers/gpu/drm/xe/xe_tlb_inval_types.h
+++ b/drivers/gpu/drm/xe/xe_tlb_inval_types.h
@@ -9,6 +9,8 @@
#include <linux/workqueue.h>
#include <linux/dma-fence.h>
+#include "xe_device_types.h"
+
struct drm_suballoc;
struct xe_tlb_inval;
@@ -132,4 +134,16 @@ struct xe_tlb_inval_fence {
ktime_t inval_time;
};
+/**
+ * struct xe_tlb_inval_batch - Batch of TLB invalidation fences
+ *
+ * Holds one fence per GT covered by a TLB invalidation request.
+ */
+struct xe_tlb_inval_batch {
+ /** @fence: per-GT TLB invalidation fences */
+ struct xe_tlb_inval_fence fence[XE_MAX_TILES_PER_DEVICE * XE_MAX_GT_PER_TILE];
+ /** @num_fences: number of valid entries in @fence */
+ unsigned int num_fences;
+};
+
#endif
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 548b0769b3ef..7f29d2b2972d 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -3966,66 +3966,6 @@ void xe_vm_unlock(struct xe_vm *vm)
dma_resv_unlock(xe_vm_resv(vm));
}
-/**
- * xe_vm_range_tilemask_tlb_inval - Issue a TLB invalidation on this tilemask for an
- * address range
- * @vm: The VM
- * @start: start address
- * @end: end address
- * @tile_mask: mask for which gt's issue tlb invalidation
- *
- * Issue a range based TLB invalidation for gt's in tilemask
- *
- * Returns 0 for success, negative error code otherwise.
- */
-int xe_vm_range_tilemask_tlb_inval(struct xe_vm *vm, u64 start,
- u64 end, u8 tile_mask)
-{
- struct xe_tlb_inval_fence
- fence[XE_MAX_TILES_PER_DEVICE * XE_MAX_GT_PER_TILE];
- struct xe_tile *tile;
- u32 fence_id = 0;
- u8 id;
- int err;
-
- if (!tile_mask)
- return 0;
-
- for_each_tile(tile, vm->xe, id) {
- if (!(tile_mask & BIT(id)))
- continue;
-
- xe_tlb_inval_fence_init(&tile->primary_gt->tlb_inval,
- &fence[fence_id], true);
-
- err = xe_tlb_inval_range(&tile->primary_gt->tlb_inval,
- &fence[fence_id], start, end,
- vm->usm.asid, NULL);
- if (err)
- goto wait;
- ++fence_id;
-
- if (!tile->media_gt)
- continue;
-
- xe_tlb_inval_fence_init(&tile->media_gt->tlb_inval,
- &fence[fence_id], true);
-
- err = xe_tlb_inval_range(&tile->media_gt->tlb_inval,
- &fence[fence_id], start, end,
- vm->usm.asid, NULL);
- if (err)
- goto wait;
- ++fence_id;
- }
-
-wait:
- for (id = 0; id < fence_id; ++id)
- xe_tlb_inval_fence_wait(&fence[id]);
-
- return err;
-}
-
/**
* xe_vm_invalidate_vma - invalidate GPU mappings for VMA without a lock
* @vma: VMA to invalidate
@@ -4040,6 +3980,7 @@ int xe_vm_invalidate_vma(struct xe_vma *vma)
{
struct xe_device *xe = xe_vma_vm(vma)->xe;
struct xe_vm *vm = xe_vma_vm(vma);
+ struct xe_tlb_inval_batch _batch;
struct xe_tile *tile;
u8 tile_mask = 0;
int ret = 0;
@@ -4080,12 +4021,16 @@ int xe_vm_invalidate_vma(struct xe_vma *vma)
xe_device_wmb(xe);
- ret = xe_vm_range_tilemask_tlb_inval(xe_vma_vm(vma), xe_vma_start(vma),
- xe_vma_end(vma), tile_mask);
+ ret = xe_tlb_inval_range_tilemask_submit(xe, xe_vma_vm(vma)->usm.asid,
+ xe_vma_start(vma), xe_vma_end(vma),
+ tile_mask, &_batch);
/* WRITE_ONCE pairs with READ_ONCE in xe_vm_has_valid_gpu_mapping() */
WRITE_ONCE(vma->tile_invalidated, vma->tile_mask);
+ if (!ret)
+ xe_tlb_inval_batch_wait(&_batch);
+
return ret;
}
diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index f849e369432b..62f4b6fec0bc 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -240,9 +240,6 @@ struct dma_fence *xe_vm_range_rebind(struct xe_vm *vm,
struct dma_fence *xe_vm_range_unbind(struct xe_vm *vm,
struct xe_svm_range *range);
-int xe_vm_range_tilemask_tlb_inval(struct xe_vm *vm, u64 start,
- u64 end, u8 tile_mask);
-
int xe_vm_invalidate_vma(struct xe_vma *vma);
int xe_vm_validate_protected(struct xe_vm *vm);
diff --git a/drivers/gpu/drm/xe/xe_vm_madvise.c b/drivers/gpu/drm/xe/xe_vm_madvise.c
index 95bf53cc29e3..39717026e84f 100644
--- a/drivers/gpu/drm/xe/xe_vm_madvise.c
+++ b/drivers/gpu/drm/xe/xe_vm_madvise.c
@@ -12,6 +12,7 @@
#include "xe_pat.h"
#include "xe_pt.h"
#include "xe_svm.h"
+#include "xe_tlb_inval.h"
struct xe_vmas_in_madvise_range {
u64 addr;
@@ -235,13 +236,19 @@ static u8 xe_zap_ptes_in_madvise_range(struct xe_vm *vm, u64 start, u64 end)
static int xe_vm_invalidate_madvise_range(struct xe_vm *vm, u64 start, u64 end)
{
u8 tile_mask = xe_zap_ptes_in_madvise_range(vm, start, end);
+ struct xe_tlb_inval_batch batch;
+ int err;
if (!tile_mask)
return 0;
xe_device_wmb(vm->xe);
- return xe_vm_range_tilemask_tlb_inval(vm, start, end, tile_mask);
+ err = xe_tlb_inval_range_tilemask_submit(vm->xe, vm->usm.asid, start, end,
+ tile_mask, &batch);
+ xe_tlb_inval_batch_wait(&batch);
+
+ return err;
}
static bool madvise_args_are_sane(struct xe_device *xe, const struct drm_xe_madvise *args)
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index 1f6f7e30e751..de6544165cfa 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -18,6 +18,7 @@
#include "xe_device_types.h"
#include "xe_pt_types.h"
#include "xe_range_fence.h"
+#include "xe_tlb_inval_types.h"
#include "xe_userptr.h"
struct drm_pagemap;
--
2.53.0
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH v2 3/4] drm/xe: Split TLB invalidation into submit and wait steps
2026-03-02 16:32 ` [PATCH v2 3/4] drm/xe: Split TLB invalidation into submit and wait steps Thomas Hellström
@ 2026-03-02 19:06 ` Matthew Brost
2026-03-02 21:29 ` Thomas Hellström
2026-03-03 3:05 ` Claude review: " Claude Code Review Bot
1 sibling, 1 reply; 21+ messages in thread
From: Matthew Brost @ 2026-03-02 19:06 UTC (permalink / raw)
To: Thomas Hellström
Cc: intel-xe, Christian König, dri-devel, Jason Gunthorpe,
Andrew Morton, Simona Vetter, Dave Airlie, Alistair Popple,
linux-mm, linux-kernel
On Mon, Mar 02, 2026 at 05:32:47PM +0100, Thomas Hellström wrote:
> xe_vm_range_tilemask_tlb_inval() submits TLB invalidation requests to
> all GTs in a tile mask and then immediately waits for them to complete
> before returning. This is fine for the existing callers, but a
> subsequent patch will need to defer the wait in order to overlap TLB
> invalidations across multiple VMAs.
>
> Introduce xe_tlb_inval_range_tilemask_submit() and
> xe_tlb_inval_batch_wait() in xe_tlb_inval.c as the submit and wait
> halves respectively. The batch of fences is carried in the new
> xe_tlb_inval_batch structure. Remove xe_vm_range_tilemask_tlb_inval()
> and convert all three call sites to the new API.
>
Mostly nits...
> Assisted-by: GitHub Copilot:claude-sonnet-4.6
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> ---
> drivers/gpu/drm/xe/xe_svm.c | 6 +-
> drivers/gpu/drm/xe/xe_tlb_inval.c | 82 +++++++++++++++++++++++++
> drivers/gpu/drm/xe/xe_tlb_inval.h | 6 ++
> drivers/gpu/drm/xe/xe_tlb_inval_types.h | 14 +++++
> drivers/gpu/drm/xe/xe_vm.c | 69 +++------------------
> drivers/gpu/drm/xe/xe_vm.h | 3 -
> drivers/gpu/drm/xe/xe_vm_madvise.c | 9 ++-
> drivers/gpu/drm/xe/xe_vm_types.h | 1 +
> 8 files changed, 123 insertions(+), 67 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
> index 002b6c22ad3f..6ea4972c2791 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -19,6 +19,7 @@
> #include "xe_pt.h"
> #include "xe_svm.h"
> #include "xe_tile.h"
> +#include "xe_tlb_inval.h"
> #include "xe_ttm_vram_mgr.h"
> #include "xe_vm.h"
> #include "xe_vm_types.h"
> @@ -225,6 +226,7 @@ static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
> const struct mmu_notifier_range *mmu_range)
> {
> struct xe_vm *vm = gpusvm_to_vm(gpusvm);
> + struct xe_tlb_inval_batch _batch;
> struct xe_device *xe = vm->xe;
> struct drm_gpusvm_range *r, *first;
> struct xe_tile *tile;
> @@ -276,7 +278,9 @@ static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
>
> xe_device_wmb(xe);
>
> - err = xe_vm_range_tilemask_tlb_inval(vm, adj_start, adj_end, tile_mask);
> + err = xe_tlb_inval_range_tilemask_submit(xe, vm->usm.asid, adj_start, adj_end,
> + tile_mask, &_batch);
> + xe_tlb_inval_batch_wait(&_batch);
No need to call wait on an error but it is harmless.
So you could write it like this:
if (!WARN_ON_ONCE(err))
xe_tlb_inval_batch_wait(&_batch);
> WARN_ON_ONCE(err);
>
> range_notifier_event_end:
> diff --git a/drivers/gpu/drm/xe/xe_tlb_inval.c b/drivers/gpu/drm/xe/xe_tlb_inval.c
> index 933f30fb617d..343e37cfe715 100644
> --- a/drivers/gpu/drm/xe/xe_tlb_inval.c
> +++ b/drivers/gpu/drm/xe/xe_tlb_inval.c
> @@ -486,3 +486,85 @@ bool xe_tlb_inval_idle(struct xe_tlb_inval *tlb_inval)
> guard(spinlock_irq)(&tlb_inval->pending_lock);
> return list_is_singular(&tlb_inval->pending_fences);
> }
> +
> +/**
> + * xe_tlb_inval_batch_wait() - Wait for all fences in a TLB invalidation batch
> + * @batch: Batch of TLB invalidation fences to wait on
> + *
> + * Waits for every fence in @batch to signal, then resets @batch so it can be
> + * reused for a subsequent invalidation.
> + */
> +void xe_tlb_inval_batch_wait(struct xe_tlb_inval_batch *batch)
> +{
> + struct xe_tlb_inval_fence *fence = &batch->fence[0];
Would this be better:
s/&batch->fence[0]/batch->fence
Personal preference I guess.
> + unsigned int i;
> +
> + for (i = 0; i < batch->num_fences; ++i)
> + xe_tlb_inval_fence_wait(fence++);
> +
> + batch->num_fences = 0;
> +}
> +
> +/**
> + * xe_tlb_inval_range_tilemask_submit() - Submit TLB invalidations for an
> + * address range on a tile mask
> + * @xe: The xe device
> + * @asid: Address space ID
> + * @start: start address
> + * @end: end address
> + * @tile_mask: mask for which gt's issue tlb invalidation
> + * @batch: Batch of tlb invalidate fences
> + *
> + * Issue a range based TLB invalidation for gt's in tilemask
> + *
Mention no need to wait on batch if this function returns an error?
> + * Returns 0 for success, negative error code otherwise.
> + */
> +int xe_tlb_inval_range_tilemask_submit(struct xe_device *xe, u32 asid,
> + u64 start, u64 end, u8 tile_mask,
> + struct xe_tlb_inval_batch *batch)
> +{
> + struct xe_tlb_inval_fence *fence = &batch->fence[0];
> + struct xe_tile *tile;
> + u32 fence_id = 0;
> + u8 id;
> + int err;
> +
> + batch->num_fences = 0;
> + if (!tile_mask)
> + return 0;
> +
> + for_each_tile(tile, xe, id) {
> + if (!(tile_mask & BIT(id)))
> + continue;
> +
> + xe_tlb_inval_fence_init(&tile->primary_gt->tlb_inval,
> + &fence[fence_id], true);
> +
> + err = xe_tlb_inval_range(&tile->primary_gt->tlb_inval,
> + &fence[fence_id], start, end,
> + asid, NULL);
> + if (err)
> + goto wait;
> + ++fence_id;
> +
> + if (!tile->media_gt)
> + continue;
> +
> + xe_tlb_inval_fence_init(&tile->media_gt->tlb_inval,
> + &fence[fence_id], true);
> +
> + err = xe_tlb_inval_range(&tile->media_gt->tlb_inval,
> + &fence[fence_id], start, end,
> + asid, NULL);
> + if (err)
> + goto wait;
> + ++fence_id;
> + }
> +
> +wait:
> + batch->num_fences = fence_id;
Should 'batch->num_fences' only get set on success?
> + if (err)
> + xe_tlb_inval_batch_wait(batch);
> +
> + return err;
> +}
> diff --git a/drivers/gpu/drm/xe/xe_tlb_inval.h b/drivers/gpu/drm/xe/xe_tlb_inval.h
> index 62089254fa23..a76b7823a5f2 100644
> --- a/drivers/gpu/drm/xe/xe_tlb_inval.h
> +++ b/drivers/gpu/drm/xe/xe_tlb_inval.h
> @@ -45,4 +45,10 @@ void xe_tlb_inval_done_handler(struct xe_tlb_inval *tlb_inval, int seqno);
>
> bool xe_tlb_inval_idle(struct xe_tlb_inval *tlb_inval);
>
> +int xe_tlb_inval_range_tilemask_submit(struct xe_device *xe, u32 asid,
> + u64 start, u64 end, u8 tile_mask,
> + struct xe_tlb_inval_batch *batch);
> +
> +void xe_tlb_inval_batch_wait(struct xe_tlb_inval_batch *batch);
> +
> #endif /* _XE_TLB_INVAL_ */
> diff --git a/drivers/gpu/drm/xe/xe_tlb_inval_types.h b/drivers/gpu/drm/xe/xe_tlb_inval_types.h
> index 3b089f90f002..3d1797d186fd 100644
> --- a/drivers/gpu/drm/xe/xe_tlb_inval_types.h
> +++ b/drivers/gpu/drm/xe/xe_tlb_inval_types.h
> @@ -9,6 +9,8 @@
> #include <linux/workqueue.h>
> #include <linux/dma-fence.h>
>
> +#include "xe_device_types.h"
> +
> struct drm_suballoc;
> struct xe_tlb_inval;
>
> @@ -132,4 +134,16 @@ struct xe_tlb_inval_fence {
> ktime_t inval_time;
> };
>
> +/**
> + * struct xe_tlb_inval_batch - Batch of TLB invalidation fences
> + *
> + * Holds one fence per GT covered by a TLB invalidation request.
> + */
> +struct xe_tlb_inval_batch {
> + /** @fence: per-GT TLB invalidation fences */
> + struct xe_tlb_inval_fence fence[XE_MAX_TILES_PER_DEVICE * XE_MAX_GT_PER_TILE];
> + /** @num_fences: number of valid entries in @fence */
> + unsigned int num_fences;
> +};
> +
> #endif
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index 548b0769b3ef..7f29d2b2972d 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -3966,66 +3966,6 @@ void xe_vm_unlock(struct xe_vm *vm)
> dma_resv_unlock(xe_vm_resv(vm));
> }
>
> -/**
> - * xe_vm_range_tilemask_tlb_inval - Issue a TLB invalidation on this tilemask for an
> - * address range
> - * @vm: The VM
> - * @start: start address
> - * @end: end address
> - * @tile_mask: mask for which gt's issue tlb invalidation
> - *
> - * Issue a range based TLB invalidation for gt's in tilemask
> - *
> - * Returns 0 for success, negative error code otherwise.
> - */
> -int xe_vm_range_tilemask_tlb_inval(struct xe_vm *vm, u64 start,
> - u64 end, u8 tile_mask)
> -{
> - struct xe_tlb_inval_fence
> - fence[XE_MAX_TILES_PER_DEVICE * XE_MAX_GT_PER_TILE];
> - struct xe_tile *tile;
> - u32 fence_id = 0;
> - u8 id;
> - int err;
> -
> - if (!tile_mask)
> - return 0;
> -
> - for_each_tile(tile, vm->xe, id) {
> - if (!(tile_mask & BIT(id)))
> - continue;
> -
> - xe_tlb_inval_fence_init(&tile->primary_gt->tlb_inval,
> - &fence[fence_id], true);
> -
> - err = xe_tlb_inval_range(&tile->primary_gt->tlb_inval,
> - &fence[fence_id], start, end,
> - vm->usm.asid, NULL);
> - if (err)
> - goto wait;
> - ++fence_id;
> -
> - if (!tile->media_gt)
> - continue;
> -
> - xe_tlb_inval_fence_init(&tile->media_gt->tlb_inval,
> - &fence[fence_id], true);
> -
> - err = xe_tlb_inval_range(&tile->media_gt->tlb_inval,
> - &fence[fence_id], start, end,
> - vm->usm.asid, NULL);
> - if (err)
> - goto wait;
> - ++fence_id;
> - }
> -
> -wait:
> - for (id = 0; id < fence_id; ++id)
> - xe_tlb_inval_fence_wait(&fence[id]);
> -
> - return err;
> -}
> -
> /**
> * xe_vm_invalidate_vma - invalidate GPU mappings for VMA without a lock
> * @vma: VMA to invalidate
> @@ -4040,6 +3980,7 @@ int xe_vm_invalidate_vma(struct xe_vma *vma)
> {
> struct xe_device *xe = xe_vma_vm(vma)->xe;
> struct xe_vm *vm = xe_vma_vm(vma);
> + struct xe_tlb_inval_batch _batch;
Why not just 'batch'?
> struct xe_tile *tile;
> u8 tile_mask = 0;
> int ret = 0;
> @@ -4080,12 +4021,16 @@ int xe_vm_invalidate_vma(struct xe_vma *vma)
>
> xe_device_wmb(xe);
>
> - ret = xe_vm_range_tilemask_tlb_inval(xe_vma_vm(vma), xe_vma_start(vma),
> - xe_vma_end(vma), tile_mask);
> + ret = xe_tlb_inval_range_tilemask_submit(xe, xe_vma_vm(vma)->usm.asid,
> + xe_vma_start(vma), xe_vma_end(vma),
> + tile_mask, &_batch);
>
> /* WRITE_ONCE pairs with READ_ONCE in xe_vm_has_valid_gpu_mapping() */
> WRITE_ONCE(vma->tile_invalidated, vma->tile_mask);
>
> + if (!ret)
> + xe_tlb_inval_batch_wait(&_batch);
> +
Here we skip the wait on error, hence my suggestion to skip waits in
other code paths or at a minimum make call sematics consistent.
> return ret;
> }
>
> diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
> index f849e369432b..62f4b6fec0bc 100644
> --- a/drivers/gpu/drm/xe/xe_vm.h
> +++ b/drivers/gpu/drm/xe/xe_vm.h
> @@ -240,9 +240,6 @@ struct dma_fence *xe_vm_range_rebind(struct xe_vm *vm,
> struct dma_fence *xe_vm_range_unbind(struct xe_vm *vm,
> struct xe_svm_range *range);
>
> -int xe_vm_range_tilemask_tlb_inval(struct xe_vm *vm, u64 start,
> - u64 end, u8 tile_mask);
> -
> int xe_vm_invalidate_vma(struct xe_vma *vma);
>
> int xe_vm_validate_protected(struct xe_vm *vm);
> diff --git a/drivers/gpu/drm/xe/xe_vm_madvise.c b/drivers/gpu/drm/xe/xe_vm_madvise.c
> index 95bf53cc29e3..39717026e84f 100644
> --- a/drivers/gpu/drm/xe/xe_vm_madvise.c
> +++ b/drivers/gpu/drm/xe/xe_vm_madvise.c
> @@ -12,6 +12,7 @@
> #include "xe_pat.h"
> #include "xe_pt.h"
> #include "xe_svm.h"
> +#include "xe_tlb_inval.h"
>
> struct xe_vmas_in_madvise_range {
> u64 addr;
> @@ -235,13 +236,19 @@ static u8 xe_zap_ptes_in_madvise_range(struct xe_vm *vm, u64 start, u64 end)
> static int xe_vm_invalidate_madvise_range(struct xe_vm *vm, u64 start, u64 end)
> {
> u8 tile_mask = xe_zap_ptes_in_madvise_range(vm, start, end);
> + struct xe_tlb_inval_batch batch;
> + int err;
>
> if (!tile_mask)
> return 0;
>
> xe_device_wmb(vm->xe);
>
> - return xe_vm_range_tilemask_tlb_inval(vm, start, end, tile_mask);
> + err = xe_tlb_inval_range_tilemask_submit(vm->xe, vm->usm.asid, start, end,
> + tile_mask, &batch);
> + xe_tlb_inval_batch_wait(&batch);
No need to wait on error.
> +
> + return err;
> }
>
> static bool madvise_args_are_sane(struct xe_device *xe, const struct drm_xe_madvise *args)
> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
> index 1f6f7e30e751..de6544165cfa 100644
> --- a/drivers/gpu/drm/xe/xe_vm_types.h
> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> @@ -18,6 +18,7 @@
> #include "xe_device_types.h"
> #include "xe_pt_types.h"
> #include "xe_range_fence.h"
> +#include "xe_tlb_inval_types.h"
> #include "xe_userptr.h"
>
> struct drm_pagemap;
> --
> 2.53.0
>
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH v2 3/4] drm/xe: Split TLB invalidation into submit and wait steps
2026-03-02 19:06 ` Matthew Brost
@ 2026-03-02 21:29 ` Thomas Hellström
2026-03-02 21:31 ` Matthew Brost
0 siblings, 1 reply; 21+ messages in thread
From: Thomas Hellström @ 2026-03-02 21:29 UTC (permalink / raw)
To: Matthew Brost
Cc: intel-xe, Christian König, dri-devel, Jason Gunthorpe,
Andrew Morton, Simona Vetter, Dave Airlie, Alistair Popple,
linux-mm, linux-kernel
On Mon, 2026-03-02 at 11:06 -0800, Matthew Brost wrote:
> On Mon, Mar 02, 2026 at 05:32:47PM +0100, Thomas Hellström wrote:
> > xe_vm_range_tilemask_tlb_inval() submits TLB invalidation requests
> > to
> > all GTs in a tile mask and then immediately waits for them to
> > complete
> > before returning. This is fine for the existing callers, but a
> > subsequent patch will need to defer the wait in order to overlap
> > TLB
> > invalidations across multiple VMAs.
> >
> > Introduce xe_tlb_inval_range_tilemask_submit() and
> > xe_tlb_inval_batch_wait() in xe_tlb_inval.c as the submit and wait
> > halves respectively. The batch of fences is carried in the new
> > xe_tlb_inval_batch structure. Remove
> > xe_vm_range_tilemask_tlb_inval()
> > and convert all three call sites to the new API.
> >
>
> Mostly nits...
>
> > Assisted-by: GitHub Copilot:claude-sonnet-4.6
> > Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > ---
> > drivers/gpu/drm/xe/xe_svm.c | 6 +-
> > drivers/gpu/drm/xe/xe_tlb_inval.c | 82
> > +++++++++++++++++++++++++
> > drivers/gpu/drm/xe/xe_tlb_inval.h | 6 ++
> > drivers/gpu/drm/xe/xe_tlb_inval_types.h | 14 +++++
> > drivers/gpu/drm/xe/xe_vm.c | 69 +++------------------
> > drivers/gpu/drm/xe/xe_vm.h | 3 -
> > drivers/gpu/drm/xe/xe_vm_madvise.c | 9 ++-
> > drivers/gpu/drm/xe/xe_vm_types.h | 1 +
> > 8 files changed, 123 insertions(+), 67 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_svm.c
> > b/drivers/gpu/drm/xe/xe_svm.c
> > index 002b6c22ad3f..6ea4972c2791 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.c
> > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > @@ -19,6 +19,7 @@
> > #include "xe_pt.h"
> > #include "xe_svm.h"
> > #include "xe_tile.h"
> > +#include "xe_tlb_inval.h"
> > #include "xe_ttm_vram_mgr.h"
> > #include "xe_vm.h"
> > #include "xe_vm_types.h"
> > @@ -225,6 +226,7 @@ static void xe_svm_invalidate(struct drm_gpusvm
> > *gpusvm,
> > const struct mmu_notifier_range
> > *mmu_range)
> > {
> > struct xe_vm *vm = gpusvm_to_vm(gpusvm);
> > + struct xe_tlb_inval_batch _batch;
> > struct xe_device *xe = vm->xe;
> > struct drm_gpusvm_range *r, *first;
> > struct xe_tile *tile;
> > @@ -276,7 +278,9 @@ static void xe_svm_invalidate(struct drm_gpusvm
> > *gpusvm,
> >
> > xe_device_wmb(xe);
> >
> > - err = xe_vm_range_tilemask_tlb_inval(vm, adj_start,
> > adj_end, tile_mask);
> > + err = xe_tlb_inval_range_tilemask_submit(xe, vm->usm.asid,
> > adj_start, adj_end,
> > + tile_mask,
> > &_batch);
> > + xe_tlb_inval_batch_wait(&_batch);
>
> No need to call wait on an error but it is harmless.
>
> So you could write it like this:
>
> if (!WARN_ON_ONCE(err))
> xe_tlb_inval_batch_wait(&_batch);
Sure.
>
> > WARN_ON_ONCE(err);
> >
> > range_notifier_event_end:
> > diff --git a/drivers/gpu/drm/xe/xe_tlb_inval.c
> > b/drivers/gpu/drm/xe/xe_tlb_inval.c
> > index 933f30fb617d..343e37cfe715 100644
> > --- a/drivers/gpu/drm/xe/xe_tlb_inval.c
> > +++ b/drivers/gpu/drm/xe/xe_tlb_inval.c
> > @@ -486,3 +486,85 @@ bool xe_tlb_inval_idle(struct xe_tlb_inval
> > *tlb_inval)
> > guard(spinlock_irq)(&tlb_inval->pending_lock);
> > return list_is_singular(&tlb_inval->pending_fences);
> > }
> > +
> > +/**
> > + * xe_tlb_inval_batch_wait() - Wait for all fences in a TLB
> > invalidation batch
> > + * @batch: Batch of TLB invalidation fences to wait on
> > + *
> > + * Waits for every fence in @batch to signal, then resets @batch
> > so it can be
> > + * reused for a subsequent invalidation.
> > + */
> > +void xe_tlb_inval_batch_wait(struct xe_tlb_inval_batch *batch)
> > +{
> > + struct xe_tlb_inval_fence *fence = &batch->fence[0];
>
> Would this be better:
>
> s/&batch->fence[0]/batch->fence
>
> Personal preference I guess.
Yeah, I typically use the former to make it easier for
the reader to remember we're pointing to the first element of an array.
>
> > + unsigned int i;
> > +
> > + for (i = 0; i < batch->num_fences; ++i)
> > + xe_tlb_inval_fence_wait(fence++);
> > +
> > + batch->num_fences = 0;
> > +}
> > +
> > +/**
> > + * xe_tlb_inval_range_tilemask_submit() - Submit TLB invalidations
> > for an
> > + * address range on a tile mask
> > + * @xe: The xe device
> > + * @asid: Address space ID
> > + * @start: start address
> > + * @end: end address
> > + * @tile_mask: mask for which gt's issue tlb invalidation
> > + * @batch: Batch of tlb invalidate fences
> > + *
> > + * Issue a range based TLB invalidation for gt's in tilemask
> > + *
>
> Mention no need to wait on batch if this function returns an error?
Sure.
>
> > + * Returns 0 for success, negative error code otherwise.
> > + */
> > +int xe_tlb_inval_range_tilemask_submit(struct xe_device *xe, u32
> > asid,
> > + u64 start, u64 end, u8
> > tile_mask,
> > + struct xe_tlb_inval_batch
> > *batch)
> > +{
> > + struct xe_tlb_inval_fence *fence = &batch->fence[0];
> > + struct xe_tile *tile;
> > + u32 fence_id = 0;
> > + u8 id;
> > + int err;
> > +
> > + batch->num_fences = 0;
> > + if (!tile_mask)
> > + return 0;
> > +
> > + for_each_tile(tile, xe, id) {
> > + if (!(tile_mask & BIT(id)))
> > + continue;
> > +
> > + xe_tlb_inval_fence_init(&tile->primary_gt-
> > >tlb_inval,
> > + &fence[fence_id], true);
> > +
> > + err = xe_tlb_inval_range(&tile->primary_gt-
> > >tlb_inval,
> > + &fence[fence_id], start,
> > end,
> > + asid, NULL);
> > + if (err)
> > + goto wait;
> > + ++fence_id;
> > +
> > + if (!tile->media_gt)
> > + continue;
> > +
> > + xe_tlb_inval_fence_init(&tile->media_gt-
> > >tlb_inval,
> > + &fence[fence_id], true);
> > +
> > + err = xe_tlb_inval_range(&tile->media_gt-
> > >tlb_inval,
> > + &fence[fence_id], start,
> > end,
> > + asid, NULL);
> > + if (err)
> > + goto wait;
> > + ++fence_id;
> > + }
> > +
> > +wait:
> > + batch->num_fences = fence_id;
>
> Should 'batch->num_fences' only get set on success?
We need it for the error wait below, after which it gets cleared.
>
> > + if (err)
> > + xe_tlb_inval_batch_wait(batch);
> > +
> > + return err;
> > +}
> > diff --git a/drivers/gpu/drm/xe/xe_tlb_inval.h
> > b/drivers/gpu/drm/xe/xe_tlb_inval.h
> > index 62089254fa23..a76b7823a5f2 100644
> > --- a/drivers/gpu/drm/xe/xe_tlb_inval.h
> > +++ b/drivers/gpu/drm/xe/xe_tlb_inval.h
> > @@ -45,4 +45,10 @@ void xe_tlb_inval_done_handler(struct
> > xe_tlb_inval *tlb_inval, int seqno);
> >
> > bool xe_tlb_inval_idle(struct xe_tlb_inval *tlb_inval);
> >
> > +int xe_tlb_inval_range_tilemask_submit(struct xe_device *xe, u32
> > asid,
> > + u64 start, u64 end, u8
> > tile_mask,
> > + struct xe_tlb_inval_batch
> > *batch);
> > +
> > +void xe_tlb_inval_batch_wait(struct xe_tlb_inval_batch *batch);
> > +
> > #endif /* _XE_TLB_INVAL_ */
> > diff --git a/drivers/gpu/drm/xe/xe_tlb_inval_types.h
> > b/drivers/gpu/drm/xe/xe_tlb_inval_types.h
> > index 3b089f90f002..3d1797d186fd 100644
> > --- a/drivers/gpu/drm/xe/xe_tlb_inval_types.h
> > +++ b/drivers/gpu/drm/xe/xe_tlb_inval_types.h
> > @@ -9,6 +9,8 @@
> > #include <linux/workqueue.h>
> > #include <linux/dma-fence.h>
> >
> > +#include "xe_device_types.h"
> > +
> > struct drm_suballoc;
> > struct xe_tlb_inval;
> >
> > @@ -132,4 +134,16 @@ struct xe_tlb_inval_fence {
> > ktime_t inval_time;
> > };
> >
> > +/**
> > + * struct xe_tlb_inval_batch - Batch of TLB invalidation fences
> > + *
> > + * Holds one fence per GT covered by a TLB invalidation request.
> > + */
> > +struct xe_tlb_inval_batch {
> > + /** @fence: per-GT TLB invalidation fences */
> > + struct xe_tlb_inval_fence fence[XE_MAX_TILES_PER_DEVICE *
> > XE_MAX_GT_PER_TILE];
> > + /** @num_fences: number of valid entries in @fence */
> > + unsigned int num_fences;
> > +};
> > +
> > #endif
> > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> > b/drivers/gpu/drm/xe/xe_vm.c
> > index 548b0769b3ef..7f29d2b2972d 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.c
> > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > @@ -3966,66 +3966,6 @@ void xe_vm_unlock(struct xe_vm *vm)
> > dma_resv_unlock(xe_vm_resv(vm));
> > }
> >
> > -/**
> > - * xe_vm_range_tilemask_tlb_inval - Issue a TLB invalidation on
> > this tilemask for an
> > - * address range
> > - * @vm: The VM
> > - * @start: start address
> > - * @end: end address
> > - * @tile_mask: mask for which gt's issue tlb invalidation
> > - *
> > - * Issue a range based TLB invalidation for gt's in tilemask
> > - *
> > - * Returns 0 for success, negative error code otherwise.
> > - */
> > -int xe_vm_range_tilemask_tlb_inval(struct xe_vm *vm, u64 start,
> > - u64 end, u8 tile_mask)
> > -{
> > - struct xe_tlb_inval_fence
> > - fence[XE_MAX_TILES_PER_DEVICE *
> > XE_MAX_GT_PER_TILE];
> > - struct xe_tile *tile;
> > - u32 fence_id = 0;
> > - u8 id;
> > - int err;
> > -
> > - if (!tile_mask)
> > - return 0;
> > -
> > - for_each_tile(tile, vm->xe, id) {
> > - if (!(tile_mask & BIT(id)))
> > - continue;
> > -
> > - xe_tlb_inval_fence_init(&tile->primary_gt-
> > >tlb_inval,
> > - &fence[fence_id], true);
> > -
> > - err = xe_tlb_inval_range(&tile->primary_gt-
> > >tlb_inval,
> > - &fence[fence_id], start,
> > end,
> > - vm->usm.asid, NULL);
> > - if (err)
> > - goto wait;
> > - ++fence_id;
> > -
> > - if (!tile->media_gt)
> > - continue;
> > -
> > - xe_tlb_inval_fence_init(&tile->media_gt-
> > >tlb_inval,
> > - &fence[fence_id], true);
> > -
> > - err = xe_tlb_inval_range(&tile->media_gt-
> > >tlb_inval,
> > - &fence[fence_id], start,
> > end,
> > - vm->usm.asid, NULL);
> > - if (err)
> > - goto wait;
> > - ++fence_id;
> > - }
> > -
> > -wait:
> > - for (id = 0; id < fence_id; ++id)
> > - xe_tlb_inval_fence_wait(&fence[id]);
> > -
> > - return err;
> > -}
> > -
> > /**
> > * xe_vm_invalidate_vma - invalidate GPU mappings for VMA without
> > a lock
> > * @vma: VMA to invalidate
> > @@ -4040,6 +3980,7 @@ int xe_vm_invalidate_vma(struct xe_vma *vma)
> > {
> > struct xe_device *xe = xe_vma_vm(vma)->xe;
> > struct xe_vm *vm = xe_vma_vm(vma);
> > + struct xe_tlb_inval_batch _batch;
>
> Why not just 'batch'?
>
> > struct xe_tile *tile;
> > u8 tile_mask = 0;
> > int ret = 0;
> > @@ -4080,12 +4021,16 @@ int xe_vm_invalidate_vma(struct xe_vma
> > *vma)
> >
> > xe_device_wmb(xe);
> >
> > - ret = xe_vm_range_tilemask_tlb_inval(xe_vma_vm(vma),
> > xe_vma_start(vma),
> > - xe_vma_end(vma),
> > tile_mask);
> > + ret = xe_tlb_inval_range_tilemask_submit(xe,
> > xe_vma_vm(vma)->usm.asid,
> > +
> > xe_vma_start(vma), xe_vma_end(vma),
> > + tile_mask,
> > &_batch);
> >
> > /* WRITE_ONCE pairs with READ_ONCE in
> > xe_vm_has_valid_gpu_mapping() */
> > WRITE_ONCE(vma->tile_invalidated, vma->tile_mask);
> >
> > + if (!ret)
> > + xe_tlb_inval_batch_wait(&_batch);
> > +
>
> Here we skip the wait on error, hence my suggestion to skip waits in
> other code paths or at a minimum make call sematics consistent.
Makes sense.
>
> > return ret;
> > }
> >
> > diff --git a/drivers/gpu/drm/xe/xe_vm.h
> > b/drivers/gpu/drm/xe/xe_vm.h
> > index f849e369432b..62f4b6fec0bc 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.h
> > +++ b/drivers/gpu/drm/xe/xe_vm.h
> > @@ -240,9 +240,6 @@ struct dma_fence *xe_vm_range_rebind(struct
> > xe_vm *vm,
> > struct dma_fence *xe_vm_range_unbind(struct xe_vm *vm,
> > struct xe_svm_range *range);
> >
> > -int xe_vm_range_tilemask_tlb_inval(struct xe_vm *vm, u64 start,
> > - u64 end, u8 tile_mask);
> > -
> > int xe_vm_invalidate_vma(struct xe_vma *vma);
> >
> > int xe_vm_validate_protected(struct xe_vm *vm);
> > diff --git a/drivers/gpu/drm/xe/xe_vm_madvise.c
> > b/drivers/gpu/drm/xe/xe_vm_madvise.c
> > index 95bf53cc29e3..39717026e84f 100644
> > --- a/drivers/gpu/drm/xe/xe_vm_madvise.c
> > +++ b/drivers/gpu/drm/xe/xe_vm_madvise.c
> > @@ -12,6 +12,7 @@
> > #include "xe_pat.h"
> > #include "xe_pt.h"
> > #include "xe_svm.h"
> > +#include "xe_tlb_inval.h"
> >
> > struct xe_vmas_in_madvise_range {
> > u64 addr;
> > @@ -235,13 +236,19 @@ static u8 xe_zap_ptes_in_madvise_range(struct
> > xe_vm *vm, u64 start, u64 end)
> > static int xe_vm_invalidate_madvise_range(struct xe_vm *vm, u64
> > start, u64 end)
> > {
> > u8 tile_mask = xe_zap_ptes_in_madvise_range(vm, start,
> > end);
> > + struct xe_tlb_inval_batch batch;
> > + int err;
> >
> > if (!tile_mask)
> > return 0;
> >
> > xe_device_wmb(vm->xe);
> >
> > - return xe_vm_range_tilemask_tlb_inval(vm, start, end,
> > tile_mask);
> > + err = xe_tlb_inval_range_tilemask_submit(vm->xe, vm-
> > >usm.asid, start, end,
> > + tile_mask,
> > &batch);
> > + xe_tlb_inval_batch_wait(&batch);
>
> No need to wait on error.
Will fix
Thanks,
Thomas
>
> > +
> > + return err;
> > }
> >
> > static bool madvise_args_are_sane(struct xe_device *xe, const
> > struct drm_xe_madvise *args)
> > diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> > b/drivers/gpu/drm/xe/xe_vm_types.h
> > index 1f6f7e30e751..de6544165cfa 100644
> > --- a/drivers/gpu/drm/xe/xe_vm_types.h
> > +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> > @@ -18,6 +18,7 @@
> > #include "xe_device_types.h"
> > #include "xe_pt_types.h"
> > #include "xe_range_fence.h"
> > +#include "xe_tlb_inval_types.h"
> > #include "xe_userptr.h"
> >
> > struct drm_pagemap;
> > --
> > 2.53.0
> >
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH v2 3/4] drm/xe: Split TLB invalidation into submit and wait steps
2026-03-02 21:29 ` Thomas Hellström
@ 2026-03-02 21:31 ` Matthew Brost
0 siblings, 0 replies; 21+ messages in thread
From: Matthew Brost @ 2026-03-02 21:31 UTC (permalink / raw)
To: Thomas Hellström
Cc: intel-xe, Christian König, dri-devel, Jason Gunthorpe,
Andrew Morton, Simona Vetter, Dave Airlie, Alistair Popple,
linux-mm, linux-kernel
On Mon, Mar 02, 2026 at 10:29:22PM +0100, Thomas Hellström wrote:
> On Mon, 2026-03-02 at 11:06 -0800, Matthew Brost wrote:
> > On Mon, Mar 02, 2026 at 05:32:47PM +0100, Thomas Hellström wrote:
> > > xe_vm_range_tilemask_tlb_inval() submits TLB invalidation requests
> > > to
> > > all GTs in a tile mask and then immediately waits for them to
> > > complete
> > > before returning. This is fine for the existing callers, but a
> > > subsequent patch will need to defer the wait in order to overlap
> > > TLB
> > > invalidations across multiple VMAs.
> > >
> > > Introduce xe_tlb_inval_range_tilemask_submit() and
> > > xe_tlb_inval_batch_wait() in xe_tlb_inval.c as the submit and wait
> > > halves respectively. The batch of fences is carried in the new
> > > xe_tlb_inval_batch structure. Remove
> > > xe_vm_range_tilemask_tlb_inval()
> > > and convert all three call sites to the new API.
> > >
> >
> > Mostly nits...
> >
> > > Assisted-by: GitHub Copilot:claude-sonnet-4.6
> > > Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > ---
> > > drivers/gpu/drm/xe/xe_svm.c | 6 +-
> > > drivers/gpu/drm/xe/xe_tlb_inval.c | 82
> > > +++++++++++++++++++++++++
> > > drivers/gpu/drm/xe/xe_tlb_inval.h | 6 ++
> > > drivers/gpu/drm/xe/xe_tlb_inval_types.h | 14 +++++
> > > drivers/gpu/drm/xe/xe_vm.c | 69 +++------------------
> > > drivers/gpu/drm/xe/xe_vm.h | 3 -
> > > drivers/gpu/drm/xe/xe_vm_madvise.c | 9 ++-
> > > drivers/gpu/drm/xe/xe_vm_types.h | 1 +
> > > 8 files changed, 123 insertions(+), 67 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_svm.c
> > > b/drivers/gpu/drm/xe/xe_svm.c
> > > index 002b6c22ad3f..6ea4972c2791 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm.c
> > > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > > @@ -19,6 +19,7 @@
> > > #include "xe_pt.h"
> > > #include "xe_svm.h"
> > > #include "xe_tile.h"
> > > +#include "xe_tlb_inval.h"
> > > #include "xe_ttm_vram_mgr.h"
> > > #include "xe_vm.h"
> > > #include "xe_vm_types.h"
> > > @@ -225,6 +226,7 @@ static void xe_svm_invalidate(struct drm_gpusvm
> > > *gpusvm,
> > > const struct mmu_notifier_range
> > > *mmu_range)
> > > {
> > > struct xe_vm *vm = gpusvm_to_vm(gpusvm);
> > > + struct xe_tlb_inval_batch _batch;
> > > struct xe_device *xe = vm->xe;
> > > struct drm_gpusvm_range *r, *first;
> > > struct xe_tile *tile;
> > > @@ -276,7 +278,9 @@ static void xe_svm_invalidate(struct drm_gpusvm
> > > *gpusvm,
> > >
> > > xe_device_wmb(xe);
> > >
> > > - err = xe_vm_range_tilemask_tlb_inval(vm, adj_start,
> > > adj_end, tile_mask);
> > > + err = xe_tlb_inval_range_tilemask_submit(xe, vm->usm.asid,
> > > adj_start, adj_end,
> > > + tile_mask,
> > > &_batch);
> > > + xe_tlb_inval_batch_wait(&_batch);
> >
> > No need to call wait on an error but it is harmless.
> >
> > So you could write it like this:
> >
> > if (!WARN_ON_ONCE(err))
> > xe_tlb_inval_batch_wait(&_batch);
>
> Sure.
>
> >
> > > WARN_ON_ONCE(err);
> > >
> > > range_notifier_event_end:
> > > diff --git a/drivers/gpu/drm/xe/xe_tlb_inval.c
> > > b/drivers/gpu/drm/xe/xe_tlb_inval.c
> > > index 933f30fb617d..343e37cfe715 100644
> > > --- a/drivers/gpu/drm/xe/xe_tlb_inval.c
> > > +++ b/drivers/gpu/drm/xe/xe_tlb_inval.c
> > > @@ -486,3 +486,85 @@ bool xe_tlb_inval_idle(struct xe_tlb_inval
> > > *tlb_inval)
> > > guard(spinlock_irq)(&tlb_inval->pending_lock);
> > > return list_is_singular(&tlb_inval->pending_fences);
> > > }
> > > +
> > > +/**
> > > + * xe_tlb_inval_batch_wait() - Wait for all fences in a TLB
> > > invalidation batch
> > > + * @batch: Batch of TLB invalidation fences to wait on
> > > + *
> > > + * Waits for every fence in @batch to signal, then resets @batch
> > > so it can be
> > > + * reused for a subsequent invalidation.
> > > + */
> > > +void xe_tlb_inval_batch_wait(struct xe_tlb_inval_batch *batch)
> > > +{
> > > + struct xe_tlb_inval_fence *fence = &batch->fence[0];
> >
> > Would this be better:
> >
> > s/&batch->fence[0]/batch->fence
> >
> > Personal preference I guess.
>
> Yeah, I typically use the former to make it easier for
> the reader to remember we're pointing to the first element of an array.
>
Ok, fine with this. I know I have done it both ways more than once.
> >
> > > + unsigned int i;
> > > +
> > > + for (i = 0; i < batch->num_fences; ++i)
> > > + xe_tlb_inval_fence_wait(fence++);
> > > +
> > > + batch->num_fences = 0;
> > > +}
> > > +
> > > +/**
> > > + * xe_tlb_inval_range_tilemask_submit() - Submit TLB invalidations
> > > for an
> > > + * address range on a tile mask
> > > + * @xe: The xe device
> > > + * @asid: Address space ID
> > > + * @start: start address
> > > + * @end: end address
> > > + * @tile_mask: mask for which gt's issue tlb invalidation
> > > + * @batch: Batch of tlb invalidate fences
> > > + *
> > > + * Issue a range based TLB invalidation for gt's in tilemask
> > > + *
> >
> > Mention no need to wait on batch if this function returns an error?
>
> Sure.
>
> >
> > > + * Returns 0 for success, negative error code otherwise.
> > > + */
> > > +int xe_tlb_inval_range_tilemask_submit(struct xe_device *xe, u32
> > > asid,
> > > + u64 start, u64 end, u8
> > > tile_mask,
> > > + struct xe_tlb_inval_batch
> > > *batch)
> > > +{
> > > + struct xe_tlb_inval_fence *fence = &batch->fence[0];
> > > + struct xe_tile *tile;
> > > + u32 fence_id = 0;
> > > + u8 id;
> > > + int err;
> > > +
> > > + batch->num_fences = 0;
> > > + if (!tile_mask)
> > > + return 0;
> > > +
> > > + for_each_tile(tile, xe, id) {
> > > + if (!(tile_mask & BIT(id)))
> > > + continue;
> > > +
> > > + xe_tlb_inval_fence_init(&tile->primary_gt-
> > > >tlb_inval,
> > > + &fence[fence_id], true);
> > > +
> > > + err = xe_tlb_inval_range(&tile->primary_gt-
> > > >tlb_inval,
> > > + &fence[fence_id], start,
> > > end,
> > > + asid, NULL);
> > > + if (err)
> > > + goto wait;
> > > + ++fence_id;
> > > +
> > > + if (!tile->media_gt)
> > > + continue;
> > > +
> > > + xe_tlb_inval_fence_init(&tile->media_gt-
> > > >tlb_inval,
> > > + &fence[fence_id], true);
> > > +
> > > + err = xe_tlb_inval_range(&tile->media_gt-
> > > >tlb_inval,
> > > + &fence[fence_id], start,
> > > end,
> > > + asid, NULL);
> > > + if (err)
> > > + goto wait;
> > > + ++fence_id;
> > > + }
> > > +
> > > +wait:
> > > + batch->num_fences = fence_id;
> >
> > Should 'batch->num_fences' only get set on success?
>
> We need it for the error wait below, after which it gets cleared.
>
Right, bad suggestion.
Matt
> >
> > > + if (err)
> > > + xe_tlb_inval_batch_wait(batch);
> > > +
> > > + return err;
> > > +}
> > > diff --git a/drivers/gpu/drm/xe/xe_tlb_inval.h
> > > b/drivers/gpu/drm/xe/xe_tlb_inval.h
> > > index 62089254fa23..a76b7823a5f2 100644
> > > --- a/drivers/gpu/drm/xe/xe_tlb_inval.h
> > > +++ b/drivers/gpu/drm/xe/xe_tlb_inval.h
> > > @@ -45,4 +45,10 @@ void xe_tlb_inval_done_handler(struct
> > > xe_tlb_inval *tlb_inval, int seqno);
> > >
> > > bool xe_tlb_inval_idle(struct xe_tlb_inval *tlb_inval);
> > >
> > > +int xe_tlb_inval_range_tilemask_submit(struct xe_device *xe, u32
> > > asid,
> > > + u64 start, u64 end, u8
> > > tile_mask,
> > > + struct xe_tlb_inval_batch
> > > *batch);
> > > +
> > > +void xe_tlb_inval_batch_wait(struct xe_tlb_inval_batch *batch);
> > > +
> > > #endif /* _XE_TLB_INVAL_ */
> > > diff --git a/drivers/gpu/drm/xe/xe_tlb_inval_types.h
> > > b/drivers/gpu/drm/xe/xe_tlb_inval_types.h
> > > index 3b089f90f002..3d1797d186fd 100644
> > > --- a/drivers/gpu/drm/xe/xe_tlb_inval_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_tlb_inval_types.h
> > > @@ -9,6 +9,8 @@
> > > #include <linux/workqueue.h>
> > > #include <linux/dma-fence.h>
> > >
> > > +#include "xe_device_types.h"
> > > +
> > > struct drm_suballoc;
> > > struct xe_tlb_inval;
> > >
> > > @@ -132,4 +134,16 @@ struct xe_tlb_inval_fence {
> > > ktime_t inval_time;
> > > };
> > >
> > > +/**
> > > + * struct xe_tlb_inval_batch - Batch of TLB invalidation fences
> > > + *
> > > + * Holds one fence per GT covered by a TLB invalidation request.
> > > + */
> > > +struct xe_tlb_inval_batch {
> > > + /** @fence: per-GT TLB invalidation fences */
> > > + struct xe_tlb_inval_fence fence[XE_MAX_TILES_PER_DEVICE *
> > > XE_MAX_GT_PER_TILE];
> > > + /** @num_fences: number of valid entries in @fence */
> > > + unsigned int num_fences;
> > > +};
> > > +
> > > #endif
> > > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> > > b/drivers/gpu/drm/xe/xe_vm.c
> > > index 548b0769b3ef..7f29d2b2972d 100644
> > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > @@ -3966,66 +3966,6 @@ void xe_vm_unlock(struct xe_vm *vm)
> > > dma_resv_unlock(xe_vm_resv(vm));
> > > }
> > >
> > > -/**
> > > - * xe_vm_range_tilemask_tlb_inval - Issue a TLB invalidation on
> > > this tilemask for an
> > > - * address range
> > > - * @vm: The VM
> > > - * @start: start address
> > > - * @end: end address
> > > - * @tile_mask: mask for which gt's issue tlb invalidation
> > > - *
> > > - * Issue a range based TLB invalidation for gt's in tilemask
> > > - *
> > > - * Returns 0 for success, negative error code otherwise.
> > > - */
> > > -int xe_vm_range_tilemask_tlb_inval(struct xe_vm *vm, u64 start,
> > > - u64 end, u8 tile_mask)
> > > -{
> > > - struct xe_tlb_inval_fence
> > > - fence[XE_MAX_TILES_PER_DEVICE *
> > > XE_MAX_GT_PER_TILE];
> > > - struct xe_tile *tile;
> > > - u32 fence_id = 0;
> > > - u8 id;
> > > - int err;
> > > -
> > > - if (!tile_mask)
> > > - return 0;
> > > -
> > > - for_each_tile(tile, vm->xe, id) {
> > > - if (!(tile_mask & BIT(id)))
> > > - continue;
> > > -
> > > - xe_tlb_inval_fence_init(&tile->primary_gt-
> > > >tlb_inval,
> > > - &fence[fence_id], true);
> > > -
> > > - err = xe_tlb_inval_range(&tile->primary_gt-
> > > >tlb_inval,
> > > - &fence[fence_id], start,
> > > end,
> > > - vm->usm.asid, NULL);
> > > - if (err)
> > > - goto wait;
> > > - ++fence_id;
> > > -
> > > - if (!tile->media_gt)
> > > - continue;
> > > -
> > > - xe_tlb_inval_fence_init(&tile->media_gt-
> > > >tlb_inval,
> > > - &fence[fence_id], true);
> > > -
> > > - err = xe_tlb_inval_range(&tile->media_gt-
> > > >tlb_inval,
> > > - &fence[fence_id], start,
> > > end,
> > > - vm->usm.asid, NULL);
> > > - if (err)
> > > - goto wait;
> > > - ++fence_id;
> > > - }
> > > -
> > > -wait:
> > > - for (id = 0; id < fence_id; ++id)
> > > - xe_tlb_inval_fence_wait(&fence[id]);
> > > -
> > > - return err;
> > > -}
> > > -
> > > /**
> > > * xe_vm_invalidate_vma - invalidate GPU mappings for VMA without
> > > a lock
> > > * @vma: VMA to invalidate
> > > @@ -4040,6 +3980,7 @@ int xe_vm_invalidate_vma(struct xe_vma *vma)
> > > {
> > > struct xe_device *xe = xe_vma_vm(vma)->xe;
> > > struct xe_vm *vm = xe_vma_vm(vma);
> > > + struct xe_tlb_inval_batch _batch;
> >
> > Why not just 'batch'?
> >
> > > struct xe_tile *tile;
> > > u8 tile_mask = 0;
> > > int ret = 0;
> > > @@ -4080,12 +4021,16 @@ int xe_vm_invalidate_vma(struct xe_vma
> > > *vma)
> > >
> > > xe_device_wmb(xe);
> > >
> > > - ret = xe_vm_range_tilemask_tlb_inval(xe_vma_vm(vma),
> > > xe_vma_start(vma),
> > > - xe_vma_end(vma),
> > > tile_mask);
> > > + ret = xe_tlb_inval_range_tilemask_submit(xe,
> > > xe_vma_vm(vma)->usm.asid,
> > > +
> > > xe_vma_start(vma), xe_vma_end(vma),
> > > + tile_mask,
> > > &_batch);
> > >
> > > /* WRITE_ONCE pairs with READ_ONCE in
> > > xe_vm_has_valid_gpu_mapping() */
> > > WRITE_ONCE(vma->tile_invalidated, vma->tile_mask);
> > >
> > > + if (!ret)
> > > + xe_tlb_inval_batch_wait(&_batch);
> > > +
> >
> > Here we skip the wait on error, hence my suggestion to skip waits in
> > other code paths or at a minimum make call sematics consistent.
>
> Makes sense.
>
> >
> > > return ret;
> > > }
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_vm.h
> > > b/drivers/gpu/drm/xe/xe_vm.h
> > > index f849e369432b..62f4b6fec0bc 100644
> > > --- a/drivers/gpu/drm/xe/xe_vm.h
> > > +++ b/drivers/gpu/drm/xe/xe_vm.h
> > > @@ -240,9 +240,6 @@ struct dma_fence *xe_vm_range_rebind(struct
> > > xe_vm *vm,
> > > struct dma_fence *xe_vm_range_unbind(struct xe_vm *vm,
> > > struct xe_svm_range *range);
> > >
> > > -int xe_vm_range_tilemask_tlb_inval(struct xe_vm *vm, u64 start,
> > > - u64 end, u8 tile_mask);
> > > -
> > > int xe_vm_invalidate_vma(struct xe_vma *vma);
> > >
> > > int xe_vm_validate_protected(struct xe_vm *vm);
> > > diff --git a/drivers/gpu/drm/xe/xe_vm_madvise.c
> > > b/drivers/gpu/drm/xe/xe_vm_madvise.c
> > > index 95bf53cc29e3..39717026e84f 100644
> > > --- a/drivers/gpu/drm/xe/xe_vm_madvise.c
> > > +++ b/drivers/gpu/drm/xe/xe_vm_madvise.c
> > > @@ -12,6 +12,7 @@
> > > #include "xe_pat.h"
> > > #include "xe_pt.h"
> > > #include "xe_svm.h"
> > > +#include "xe_tlb_inval.h"
> > >
> > > struct xe_vmas_in_madvise_range {
> > > u64 addr;
> > > @@ -235,13 +236,19 @@ static u8 xe_zap_ptes_in_madvise_range(struct
> > > xe_vm *vm, u64 start, u64 end)
> > > static int xe_vm_invalidate_madvise_range(struct xe_vm *vm, u64
> > > start, u64 end)
> > > {
> > > u8 tile_mask = xe_zap_ptes_in_madvise_range(vm, start,
> > > end);
> > > + struct xe_tlb_inval_batch batch;
> > > + int err;
> > >
> > > if (!tile_mask)
> > > return 0;
> > >
> > > xe_device_wmb(vm->xe);
> > >
> > > - return xe_vm_range_tilemask_tlb_inval(vm, start, end,
> > > tile_mask);
> > > + err = xe_tlb_inval_range_tilemask_submit(vm->xe, vm-
> > > >usm.asid, start, end,
> > > + tile_mask,
> > > &batch);
> > > + xe_tlb_inval_batch_wait(&batch);
> >
> > No need to wait on error.
>
> Will fix
>
> Thanks,
> Thomas
>
>
>
> >
> > > +
> > > + return err;
> > > }
> > >
> > > static bool madvise_args_are_sane(struct xe_device *xe, const
> > > struct drm_xe_madvise *args)
> > > diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> > > b/drivers/gpu/drm/xe/xe_vm_types.h
> > > index 1f6f7e30e751..de6544165cfa 100644
> > > --- a/drivers/gpu/drm/xe/xe_vm_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> > > @@ -18,6 +18,7 @@
> > > #include "xe_device_types.h"
> > > #include "xe_pt_types.h"
> > > #include "xe_range_fence.h"
> > > +#include "xe_tlb_inval_types.h"
> > > #include "xe_userptr.h"
> > >
> > > struct drm_pagemap;
> > > --
> > > 2.53.0
> > >
^ permalink raw reply [flat|nested] 21+ messages in thread
* Claude review: drm/xe: Split TLB invalidation into submit and wait steps
2026-03-02 16:32 ` [PATCH v2 3/4] drm/xe: Split TLB invalidation into submit and wait steps Thomas Hellström
2026-03-02 19:06 ` Matthew Brost
@ 2026-03-03 3:05 ` Claude Code Review Bot
1 sibling, 0 replies; 21+ messages in thread
From: Claude Code Review Bot @ 2026-03-03 3:05 UTC (permalink / raw)
To: dri-devel-reviews
Patch Review
This is a clean refactor that extracts the submit-and-wait logic from `xe_vm_range_tilemask_tlb_inval` into two separate functions in `xe_tlb_inval.c`.
**`xe_tlb_inval_batch` struct size concern:**
```c
struct xe_tlb_inval_batch {
struct xe_tlb_inval_fence fence[XE_MAX_TILES_PER_DEVICE * XE_MAX_GT_PER_TILE];
unsigned int num_fences;
};
```
With `XE_MAX_TILES_PER_DEVICE=2` and `XE_MAX_GT_PER_TILE=2`, this is 4 `xe_tlb_inval_fence` structs. Each contains a `struct dma_fence` (~80 bytes) plus overhead, so the batch is roughly ~400-500 bytes. This was already the case in the original `xe_vm_range_tilemask_tlb_inval` (which had `fence[XE_MAX_TILES_PER_DEVICE * XE_MAX_GT_PER_TILE]` on the stack), so it's not a regression. However, in patch 4 this struct gets embedded in `xe_userptr`, which means every userptr VMA now carries this overhead permanently. Worth noting but probably acceptable.
**Header include in `xe_tlb_inval_types.h`:**
```c
+#include "xe_device_types.h"
```
This pulls in the full device types header just for `XE_MAX_TILES_PER_DEVICE` and `XE_MAX_GT_PER_TILE`. Consider whether these constants could be defined in a smaller header or if `xe_device.h` (which has the `#define XE_MAX_GT_PER_TILE 2`) would be lighter-weight. Types headers generally try to minimize includes to avoid circular dependency issues.
**Error handling in `xe_svm_invalidate` caller:**
```c
err = xe_tlb_inval_range_tilemask_submit(xe, vm->usm.asid, adj_start, adj_end,
tile_mask, &_batch);
xe_tlb_inval_batch_wait(&_batch);
WARN_ON_ONCE(err);
```
On error, `xe_tlb_inval_range_tilemask_submit` already calls `xe_tlb_inval_batch_wait` internally (the `goto wait` path sets `num_fences` and waits). Then the caller calls `xe_tlb_inval_batch_wait` again, but `num_fences` is 0 after the internal wait, so it's a no-op. This is harmless but slightly confusing - consider documenting that on error the batch is already waited/cleaned up.
**`xe_vm_invalidate_vma` behavior change:**
The original code called `xe_vm_range_tilemask_tlb_inval` which waited on error. The new code:
```c
ret = xe_tlb_inval_range_tilemask_submit(xe, ..., &_batch);
WRITE_ONCE(vma->tile_invalidated, vma->tile_mask);
if (!ret)
xe_tlb_inval_batch_wait(&_batch);
```
The `WRITE_ONCE(vma->tile_invalidated, vma->tile_mask)` now happens before the wait completes. In the original code it happened after the wait (since `xe_vm_range_tilemask_tlb_inval` blocked). This reordering seems intentional and is maintained in patch 4, but it means `tile_invalidated` is set before the TLB flush completes. Is this semantically correct? The pairing with `READ_ONCE` in `xe_vm_has_valid_gpu_mapping()` suggests this flag is read to check if invalidation was *initiated*, not *completed*, so it's likely fine.
**Naming: `_batch` with underscore prefix:**
The local variables in `xe_svm_invalidate` and `xe_vm_invalidate_vma` use `_batch` with a leading underscore. In kernel style, leading underscores on local variables are unusual and typically reserved for function/macro names. Consider just `batch` (as used in `xe_vm_invalidate_madvise_range`).
---
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH v2 4/4] drm/xe/userptr: Defer Waiting for TLB invalidation to the second pass if possible
2026-03-02 16:32 [PATCH v2 0/4] Two-pass MMU interval notifiers Thomas Hellström
` (2 preceding siblings ...)
2026-03-02 16:32 ` [PATCH v2 3/4] drm/xe: Split TLB invalidation into submit and wait steps Thomas Hellström
@ 2026-03-02 16:32 ` Thomas Hellström
2026-03-02 19:14 ` Matthew Brost
2026-03-03 3:05 ` Claude review: " Claude Code Review Bot
2026-03-03 3:05 ` Claude review: Two-pass MMU interval notifiers Claude Code Review Bot
4 siblings, 2 replies; 21+ messages in thread
From: Thomas Hellström @ 2026-03-02 16:32 UTC (permalink / raw)
To: intel-xe
Cc: Thomas Hellström, Matthew Brost, Christian König,
dri-devel, Jason Gunthorpe, Andrew Morton, Simona Vetter,
Dave Airlie, Alistair Popple, linux-mm, linux-kernel
Now that the two-pass notifier flow uses xe_vma_userptr_do_inval() for
the fence-wait + TLB-invalidate work, extend it to support a further
deferred TLB wait:
- xe_vma_userptr_do_inval(): when the embedded finish handle is free,
submit the TLB invalidation asynchronously (xe_vm_invalidate_vma_submit)
and return &userptr->finish so the mmu_notifier core schedules a third
pass. When the handle is occupied by a concurrent invalidation, fall
back to the synchronous xe_vm_invalidate_vma() path.
- xe_vma_userptr_complete_tlb_inval(): new helper called from
invalidate_finish when tlb_inval_submitted is set. Waits for the
previously submitted batch and unmaps the gpusvm pages.
xe_vma_userptr_invalidate_finish() dispatches between the two helpers
via tlb_inval_submitted, making the three possible flows explicit:
pass1 (fences pending) -> invalidate_finish -> do_inval (sync TLB)
pass1 (fences done) -> do_inval -> invalidate_finish
-> complete_tlb_inval (deferred TLB)
pass1 (finish occupied) -> do_inval (sync TLB, inline)
In multi-GPU scenarios this allows TLB flushes to be submitted on all
GPUs in one pass before any of them are waited on.
Also adds xe_vm_invalidate_vma_submit() which submits the TLB range
invalidation without blocking, populating a xe_tlb_inval_batch that
the caller waits on separately.
Assisted-by: GitHub Copilot:claude-sonnet-4.6
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
drivers/gpu/drm/xe/xe_userptr.c | 60 +++++++++++++++++++++++++++------
drivers/gpu/drm/xe/xe_userptr.h | 18 ++++++++++
drivers/gpu/drm/xe/xe_vm.c | 38 ++++++++++++++++-----
drivers/gpu/drm/xe/xe_vm.h | 2 ++
4 files changed, 99 insertions(+), 19 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_userptr.c b/drivers/gpu/drm/xe/xe_userptr.c
index 440b0a79d16f..a62b796afb93 100644
--- a/drivers/gpu/drm/xe/xe_userptr.c
+++ b/drivers/gpu/drm/xe/xe_userptr.c
@@ -8,6 +8,7 @@
#include <linux/mm.h>
+#include "xe_tlb_inval.h"
#include "xe_trace_bo.h"
/**
@@ -73,8 +74,8 @@ int xe_vma_userptr_pin_pages(struct xe_userptr_vma *uvma)
&ctx);
}
-static void xe_vma_userptr_do_inval(struct xe_vm *vm, struct xe_userptr_vma *uvma,
- bool is_deferred)
+static struct mmu_interval_notifier_finish *
+xe_vma_userptr_do_inval(struct xe_vm *vm, struct xe_userptr_vma *uvma, bool is_deferred)
{
struct xe_userptr *userptr = &uvma->userptr;
struct xe_vma *vma = &uvma->vma;
@@ -84,12 +85,23 @@ static void xe_vma_userptr_do_inval(struct xe_vm *vm, struct xe_userptr_vma *uvm
};
long err;
- err = dma_resv_wait_timeout(xe_vm_resv(vm),
- DMA_RESV_USAGE_BOOKKEEP,
+ err = dma_resv_wait_timeout(xe_vm_resv(vm), DMA_RESV_USAGE_BOOKKEEP,
false, MAX_SCHEDULE_TIMEOUT);
XE_WARN_ON(err <= 0);
if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
+ if (!userptr->finish_inuse) {
+ /*
+ * Defer the TLB wait to an extra pass so the caller
+ * can pipeline TLB flushes across GPUs before waiting
+ * on any of them.
+ */
+ userptr->finish_inuse = true;
+ userptr->tlb_inval_submitted = true;
+ err = xe_vm_invalidate_vma_submit(vma, &userptr->inval_batch);
+ XE_WARN_ON(err);
+ return &userptr->finish;
+ }
err = xe_vm_invalidate_vma(vma);
XE_WARN_ON(err);
}
@@ -98,6 +110,24 @@ static void xe_vma_userptr_do_inval(struct xe_vm *vm, struct xe_userptr_vma *uvm
userptr->finish_inuse = false;
drm_gpusvm_unmap_pages(&vm->svm.gpusvm, &uvma->userptr.pages,
xe_vma_size(vma) >> PAGE_SHIFT, &ctx);
+ return NULL;
+}
+
+static void
+xe_vma_userptr_complete_tlb_inval(struct xe_vm *vm, struct xe_userptr_vma *uvma)
+{
+ struct xe_userptr *userptr = &uvma->userptr;
+ struct xe_vma *vma = &uvma->vma;
+ struct drm_gpusvm_ctx ctx = {
+ .in_notifier = true,
+ .read_only = xe_vma_read_only(vma),
+ };
+
+ xe_tlb_inval_batch_wait(&userptr->inval_batch);
+ userptr->tlb_inval_submitted = false;
+ userptr->finish_inuse = false;
+ drm_gpusvm_unmap_pages(&vm->svm.gpusvm, &uvma->userptr.pages,
+ xe_vma_size(vma) >> PAGE_SHIFT, &ctx);
}
static struct mmu_interval_notifier_finish *
@@ -141,13 +171,11 @@ xe_vma_userptr_invalidate_pass1(struct xe_vm *vm, struct xe_userptr_vma *uvma)
* If it's already in use, or all fences are already signaled,
* proceed directly to invalidation without deferring.
*/
- if (signaled || userptr->finish_inuse) {
- xe_vma_userptr_do_inval(vm, uvma, false);
- return NULL;
- }
+ if (signaled || userptr->finish_inuse)
+ return xe_vma_userptr_do_inval(vm, uvma, false);
+ /* Defer: the notifier core will call invalidate_finish once done. */
userptr->finish_inuse = true;
-
return &userptr->finish;
}
@@ -193,7 +221,15 @@ static void xe_vma_userptr_invalidate_finish(struct mmu_interval_notifier_finish
xe_vma_start(vma), xe_vma_size(vma));
down_write(&vm->svm.gpusvm.notifier_lock);
- xe_vma_userptr_do_inval(vm, uvma, true);
+ /*
+ * If a TLB invalidation was previously submitted (deferred from the
+ * synchronous pass1 fallback), wait for it and unmap pages.
+ * Otherwise, fences have now completed: invalidate the TLB and unmap.
+ */
+ if (uvma->userptr.tlb_inval_submitted)
+ xe_vma_userptr_complete_tlb_inval(vm, uvma);
+ else
+ xe_vma_userptr_do_inval(vm, uvma, true);
up_write(&vm->svm.gpusvm.notifier_lock);
trace_xe_vma_userptr_invalidate_complete(vma);
}
@@ -231,7 +267,9 @@ void xe_vma_userptr_force_invalidate(struct xe_userptr_vma *uvma)
finish = xe_vma_userptr_invalidate_pass1(vm, uvma);
if (finish)
- xe_vma_userptr_do_inval(vm, uvma, true);
+ finish = xe_vma_userptr_do_inval(vm, uvma, true);
+ if (finish)
+ xe_vma_userptr_complete_tlb_inval(vm, uvma);
}
#endif
diff --git a/drivers/gpu/drm/xe/xe_userptr.h b/drivers/gpu/drm/xe/xe_userptr.h
index 4f42db61fd62..7477009651c2 100644
--- a/drivers/gpu/drm/xe/xe_userptr.h
+++ b/drivers/gpu/drm/xe/xe_userptr.h
@@ -14,6 +14,8 @@
#include <drm/drm_gpusvm.h>
+#include "xe_tlb_inval_types.h"
+
struct xe_vm;
struct xe_vma;
struct xe_userptr_vma;
@@ -63,6 +65,15 @@ struct xe_userptr {
* Protected by @vm::svm.gpusvm.notifier_lock.
*/
struct mmu_interval_notifier_finish finish;
+
+ /**
+ * @inval_batch: TLB invalidation batch for deferred completion.
+ * Stores an in-flight TLB invalidation submitted during a two-pass
+ * notifier so the wait can be deferred to a subsequent pass, allowing
+ * multiple GPUs to be signalled before any of them are waited on.
+ * Protected by @vm::svm.gpusvm.notifier_lock.
+ */
+ struct xe_tlb_inval_batch inval_batch;
/**
* @finish_inuse: Whether @finish is currently in use by an in-progress
* two-pass invalidation.
@@ -70,6 +81,13 @@ struct xe_userptr {
*/
bool finish_inuse;
+ /**
+ * @tlb_inval_submitted: Whether a TLB invalidation has been submitted
+ * via @inval_batch and is pending completion. When set, the next pass
+ * must call xe_tlb_inval_batch_wait() before reusing @inval_batch.
+ * Protected by @vm::svm.gpusvm.notifier_lock.
+ */
+ bool tlb_inval_submitted;
/**
* @initial_bind: user pointer has been bound at least once.
* write: vm->svm.gpusvm.notifier_lock in read mode and vm->resv held.
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 7f29d2b2972d..fdad9329dfb4 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -3967,20 +3967,23 @@ void xe_vm_unlock(struct xe_vm *vm)
}
/**
- * xe_vm_invalidate_vma - invalidate GPU mappings for VMA without a lock
+ * xe_vm_invalidate_vma_submit - Submit a job to invalidate GPU mappings for
+ * VMA.
* @vma: VMA to invalidate
+ * @batch: TLB invalidation batch to populate; caller must later call
+ * xe_tlb_inval_batch_wait() on it to wait for completion
*
* Walks a list of page tables leaves which it memset the entries owned by this
- * VMA to zero, invalidates the TLBs, and block until TLBs invalidation is
- * complete.
+ * VMA to zero, invalidates the TLBs, but doesn't block waiting for TLB flush
+ * to complete, but instead populates @batch which can be waited on using
+ * xe_tlb_inval_batch_wait().
*
* Returns 0 for success, negative error code otherwise.
*/
-int xe_vm_invalidate_vma(struct xe_vma *vma)
+int xe_vm_invalidate_vma_submit(struct xe_vma *vma, struct xe_tlb_inval_batch *batch)
{
struct xe_device *xe = xe_vma_vm(vma)->xe;
struct xe_vm *vm = xe_vma_vm(vma);
- struct xe_tlb_inval_batch _batch;
struct xe_tile *tile;
u8 tile_mask = 0;
int ret = 0;
@@ -4023,14 +4026,33 @@ int xe_vm_invalidate_vma(struct xe_vma *vma)
ret = xe_tlb_inval_range_tilemask_submit(xe, xe_vma_vm(vma)->usm.asid,
xe_vma_start(vma), xe_vma_end(vma),
- tile_mask, &_batch);
+ tile_mask, batch);
/* WRITE_ONCE pairs with READ_ONCE in xe_vm_has_valid_gpu_mapping() */
WRITE_ONCE(vma->tile_invalidated, vma->tile_mask);
+ return ret;
+}
+
+/**
+ * xe_vm_invalidate_vma - invalidate GPU mappings for VMA without a lock
+ * @vma: VMA to invalidate
+ *
+ * Walks a list of page tables leaves which it memset the entries owned by this
+ * VMA to zero, invalidates the TLBs, and block until TLBs invalidation is
+ * complete.
+ *
+ * Returns 0 for success, negative error code otherwise.
+ */
+int xe_vm_invalidate_vma(struct xe_vma *vma)
+{
+ struct xe_tlb_inval_batch batch;
+ int ret;
- if (!ret)
- xe_tlb_inval_batch_wait(&_batch);
+ ret = xe_vm_invalidate_vma_submit(vma, &batch);
+ if (ret)
+ return ret;
+ xe_tlb_inval_batch_wait(&batch);
return ret;
}
diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index 62f4b6fec0bc..0bc7ed23eeae 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -242,6 +242,8 @@ struct dma_fence *xe_vm_range_unbind(struct xe_vm *vm,
int xe_vm_invalidate_vma(struct xe_vma *vma);
+int xe_vm_invalidate_vma_submit(struct xe_vma *vma, struct xe_tlb_inval_batch *batch);
+
int xe_vm_validate_protected(struct xe_vm *vm);
static inline void xe_vm_queue_rebind_worker(struct xe_vm *vm)
--
2.53.0
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH v2 4/4] drm/xe/userptr: Defer Waiting for TLB invalidation to the second pass if possible
2026-03-02 16:32 ` [PATCH v2 4/4] drm/xe/userptr: Defer Waiting for TLB invalidation to the second pass if possible Thomas Hellström
@ 2026-03-02 19:14 ` Matthew Brost
2026-03-02 21:33 ` Thomas Hellström
2026-03-03 3:05 ` Claude review: " Claude Code Review Bot
1 sibling, 1 reply; 21+ messages in thread
From: Matthew Brost @ 2026-03-02 19:14 UTC (permalink / raw)
To: Thomas Hellström
Cc: intel-xe, Christian König, dri-devel, Jason Gunthorpe,
Andrew Morton, Simona Vetter, Dave Airlie, Alistair Popple,
linux-mm, linux-kernel
On Mon, Mar 02, 2026 at 05:32:48PM +0100, Thomas Hellström wrote:
> Now that the two-pass notifier flow uses xe_vma_userptr_do_inval() for
> the fence-wait + TLB-invalidate work, extend it to support a further
> deferred TLB wait:
>
> - xe_vma_userptr_do_inval(): when the embedded finish handle is free,
> submit the TLB invalidation asynchronously (xe_vm_invalidate_vma_submit)
> and return &userptr->finish so the mmu_notifier core schedules a third
> pass. When the handle is occupied by a concurrent invalidation, fall
> back to the synchronous xe_vm_invalidate_vma() path.
>
> - xe_vma_userptr_complete_tlb_inval(): new helper called from
> invalidate_finish when tlb_inval_submitted is set. Waits for the
> previously submitted batch and unmaps the gpusvm pages.
>
> xe_vma_userptr_invalidate_finish() dispatches between the two helpers
> via tlb_inval_submitted, making the three possible flows explicit:
>
> pass1 (fences pending) -> invalidate_finish -> do_inval (sync TLB)
> pass1 (fences done) -> do_inval -> invalidate_finish
> -> complete_tlb_inval (deferred TLB)
> pass1 (finish occupied) -> do_inval (sync TLB, inline)
>
> In multi-GPU scenarios this allows TLB flushes to be submitted on all
> GPUs in one pass before any of them are waited on.
>
> Also adds xe_vm_invalidate_vma_submit() which submits the TLB range
> invalidation without blocking, populating a xe_tlb_inval_batch that
> the caller waits on separately.
>
As suggested in patch #2, maybe squash this into patch #2 as some of
patch #2 is immediately tweaked / rewritten here.
A couple nits.
> Assisted-by: GitHub Copilot:claude-sonnet-4.6
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> ---
> drivers/gpu/drm/xe/xe_userptr.c | 60 +++++++++++++++++++++++++++------
> drivers/gpu/drm/xe/xe_userptr.h | 18 ++++++++++
> drivers/gpu/drm/xe/xe_vm.c | 38 ++++++++++++++++-----
> drivers/gpu/drm/xe/xe_vm.h | 2 ++
> 4 files changed, 99 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_userptr.c b/drivers/gpu/drm/xe/xe_userptr.c
> index 440b0a79d16f..a62b796afb93 100644
> --- a/drivers/gpu/drm/xe/xe_userptr.c
> +++ b/drivers/gpu/drm/xe/xe_userptr.c
> @@ -8,6 +8,7 @@
>
> #include <linux/mm.h>
>
> +#include "xe_tlb_inval.h"
> #include "xe_trace_bo.h"
>
> /**
> @@ -73,8 +74,8 @@ int xe_vma_userptr_pin_pages(struct xe_userptr_vma *uvma)
> &ctx);
> }
>
> -static void xe_vma_userptr_do_inval(struct xe_vm *vm, struct xe_userptr_vma *uvma,
> - bool is_deferred)
> +static struct mmu_interval_notifier_finish *
> +xe_vma_userptr_do_inval(struct xe_vm *vm, struct xe_userptr_vma *uvma, bool is_deferred)
> {
> struct xe_userptr *userptr = &uvma->userptr;
> struct xe_vma *vma = &uvma->vma;
> @@ -84,12 +85,23 @@ static void xe_vma_userptr_do_inval(struct xe_vm *vm, struct xe_userptr_vma *uvm
> };
> long err;
>
> - err = dma_resv_wait_timeout(xe_vm_resv(vm),
> - DMA_RESV_USAGE_BOOKKEEP,
> + err = dma_resv_wait_timeout(xe_vm_resv(vm), DMA_RESV_USAGE_BOOKKEEP,
> false, MAX_SCHEDULE_TIMEOUT);
Unrelated.
> XE_WARN_ON(err <= 0);
>
> if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
> + if (!userptr->finish_inuse) {
Since this is state machiney - should we have asserts on state? That
typically the approach I take when I write stae machiney code. Self
documenting plus immediatelt catches misuse.
So here an example would be:
xe_assert(.., !userptr->tlb_inval_submitted);
> + /*
> + * Defer the TLB wait to an extra pass so the caller
> + * can pipeline TLB flushes across GPUs before waiting
> + * on any of them.
> + */
> + userptr->finish_inuse = true;
> + userptr->tlb_inval_submitted = true;
> + err = xe_vm_invalidate_vma_submit(vma, &userptr->inval_batch);
> + XE_WARN_ON(err);
> + return &userptr->finish;
> + }
> err = xe_vm_invalidate_vma(vma);
> XE_WARN_ON(err);
> }
> @@ -98,6 +110,24 @@ static void xe_vma_userptr_do_inval(struct xe_vm *vm, struct xe_userptr_vma *uvm
> userptr->finish_inuse = false;
> drm_gpusvm_unmap_pages(&vm->svm.gpusvm, &uvma->userptr.pages,
> xe_vma_size(vma) >> PAGE_SHIFT, &ctx);
> + return NULL;
> +}
> +
> +static void
> +xe_vma_userptr_complete_tlb_inval(struct xe_vm *vm, struct xe_userptr_vma *uvma)
> +{
> + struct xe_userptr *userptr = &uvma->userptr;
> + struct xe_vma *vma = &uvma->vma;
> + struct drm_gpusvm_ctx ctx = {
> + .in_notifier = true,
> + .read_only = xe_vma_read_only(vma),
> + };
> +
xe_svm_assert_in_notifier();
State machine asserts could be:
xe_assert(..., userptr->tlb_inval_submitted);
xe_assert(..., userptr->finish_inuse);
> + xe_tlb_inval_batch_wait(&userptr->inval_batch);
> + userptr->tlb_inval_submitted = false;
> + userptr->finish_inuse = false;
> + drm_gpusvm_unmap_pages(&vm->svm.gpusvm, &uvma->userptr.pages,
> + xe_vma_size(vma) >> PAGE_SHIFT, &ctx);
> }
>
> static struct mmu_interval_notifier_finish *
> @@ -141,13 +171,11 @@ xe_vma_userptr_invalidate_pass1(struct xe_vm *vm, struct xe_userptr_vma *uvma)
> * If it's already in use, or all fences are already signaled,
> * proceed directly to invalidation without deferring.
> */
> - if (signaled || userptr->finish_inuse) {
> - xe_vma_userptr_do_inval(vm, uvma, false);
> - return NULL;
> - }
> + if (signaled || userptr->finish_inuse)
> + return xe_vma_userptr_do_inval(vm, uvma, false);
>
> + /* Defer: the notifier core will call invalidate_finish once done. */
> userptr->finish_inuse = true;
> -
Unrelated.
> return &userptr->finish;
> }
>
> @@ -193,7 +221,15 @@ static void xe_vma_userptr_invalidate_finish(struct mmu_interval_notifier_finish
> xe_vma_start(vma), xe_vma_size(vma));
>
> down_write(&vm->svm.gpusvm.notifier_lock);
> - xe_vma_userptr_do_inval(vm, uvma, true);
> + /*
> + * If a TLB invalidation was previously submitted (deferred from the
> + * synchronous pass1 fallback), wait for it and unmap pages.
> + * Otherwise, fences have now completed: invalidate the TLB and unmap.
> + */
> + if (uvma->userptr.tlb_inval_submitted)
> + xe_vma_userptr_complete_tlb_inval(vm, uvma);
> + else
> + xe_vma_userptr_do_inval(vm, uvma, true);
> up_write(&vm->svm.gpusvm.notifier_lock);
> trace_xe_vma_userptr_invalidate_complete(vma);
> }
> @@ -231,7 +267,9 @@ void xe_vma_userptr_force_invalidate(struct xe_userptr_vma *uvma)
>
> finish = xe_vma_userptr_invalidate_pass1(vm, uvma);
> if (finish)
> - xe_vma_userptr_do_inval(vm, uvma, true);
> + finish = xe_vma_userptr_do_inval(vm, uvma, true);
> + if (finish)
> + xe_vma_userptr_complete_tlb_inval(vm, uvma);
> }
> #endif
>
> diff --git a/drivers/gpu/drm/xe/xe_userptr.h b/drivers/gpu/drm/xe/xe_userptr.h
> index 4f42db61fd62..7477009651c2 100644
> --- a/drivers/gpu/drm/xe/xe_userptr.h
> +++ b/drivers/gpu/drm/xe/xe_userptr.h
> @@ -14,6 +14,8 @@
>
> #include <drm/drm_gpusvm.h>
>
> +#include "xe_tlb_inval_types.h"
> +
> struct xe_vm;
> struct xe_vma;
> struct xe_userptr_vma;
> @@ -63,6 +65,15 @@ struct xe_userptr {
> * Protected by @vm::svm.gpusvm.notifier_lock.
> */
> struct mmu_interval_notifier_finish finish;
> +
> + /**
> + * @inval_batch: TLB invalidation batch for deferred completion.
> + * Stores an in-flight TLB invalidation submitted during a two-pass
> + * notifier so the wait can be deferred to a subsequent pass, allowing
> + * multiple GPUs to be signalled before any of them are waited on.
> + * Protected by @vm::svm.gpusvm.notifier_lock.
In write mode?
> + */
> + struct xe_tlb_inval_batch inval_batch;
> /**
> * @finish_inuse: Whether @finish is currently in use by an in-progress
> * two-pass invalidation.
> @@ -70,6 +81,13 @@ struct xe_userptr {
> */
> bool finish_inuse;
>
> + /**
> + * @tlb_inval_submitted: Whether a TLB invalidation has been submitted
> + * via @inval_batch and is pending completion. When set, the next pass
> + * must call xe_tlb_inval_batch_wait() before reusing @inval_batch.
> + * Protected by @vm::svm.gpusvm.notifier_lock.
In write mode?
Matt
> + */
> + bool tlb_inval_submitted;
> /**
> * @initial_bind: user pointer has been bound at least once.
> * write: vm->svm.gpusvm.notifier_lock in read mode and vm->resv held.
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index 7f29d2b2972d..fdad9329dfb4 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -3967,20 +3967,23 @@ void xe_vm_unlock(struct xe_vm *vm)
> }
>
> /**
> - * xe_vm_invalidate_vma - invalidate GPU mappings for VMA without a lock
> + * xe_vm_invalidate_vma_submit - Submit a job to invalidate GPU mappings for
> + * VMA.
> * @vma: VMA to invalidate
> + * @batch: TLB invalidation batch to populate; caller must later call
> + * xe_tlb_inval_batch_wait() on it to wait for completion
> *
> * Walks a list of page tables leaves which it memset the entries owned by this
> - * VMA to zero, invalidates the TLBs, and block until TLBs invalidation is
> - * complete.
> + * VMA to zero, invalidates the TLBs, but doesn't block waiting for TLB flush
> + * to complete, but instead populates @batch which can be waited on using
> + * xe_tlb_inval_batch_wait().
> *
> * Returns 0 for success, negative error code otherwise.
> */
> -int xe_vm_invalidate_vma(struct xe_vma *vma)
> +int xe_vm_invalidate_vma_submit(struct xe_vma *vma, struct xe_tlb_inval_batch *batch)
> {
> struct xe_device *xe = xe_vma_vm(vma)->xe;
> struct xe_vm *vm = xe_vma_vm(vma);
> - struct xe_tlb_inval_batch _batch;
> struct xe_tile *tile;
> u8 tile_mask = 0;
> int ret = 0;
> @@ -4023,14 +4026,33 @@ int xe_vm_invalidate_vma(struct xe_vma *vma)
>
> ret = xe_tlb_inval_range_tilemask_submit(xe, xe_vma_vm(vma)->usm.asid,
> xe_vma_start(vma), xe_vma_end(vma),
> - tile_mask, &_batch);
> + tile_mask, batch);
>
> /* WRITE_ONCE pairs with READ_ONCE in xe_vm_has_valid_gpu_mapping() */
> WRITE_ONCE(vma->tile_invalidated, vma->tile_mask);
> + return ret;
> +}
> +
> +/**
> + * xe_vm_invalidate_vma - invalidate GPU mappings for VMA without a lock
> + * @vma: VMA to invalidate
> + *
> + * Walks a list of page tables leaves which it memset the entries owned by this
> + * VMA to zero, invalidates the TLBs, and block until TLBs invalidation is
> + * complete.
> + *
> + * Returns 0 for success, negative error code otherwise.
> + */
> +int xe_vm_invalidate_vma(struct xe_vma *vma)
> +{
> + struct xe_tlb_inval_batch batch;
> + int ret;
>
> - if (!ret)
> - xe_tlb_inval_batch_wait(&_batch);
> + ret = xe_vm_invalidate_vma_submit(vma, &batch);
> + if (ret)
> + return ret;
>
> + xe_tlb_inval_batch_wait(&batch);
> return ret;
> }
>
> diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
> index 62f4b6fec0bc..0bc7ed23eeae 100644
> --- a/drivers/gpu/drm/xe/xe_vm.h
> +++ b/drivers/gpu/drm/xe/xe_vm.h
> @@ -242,6 +242,8 @@ struct dma_fence *xe_vm_range_unbind(struct xe_vm *vm,
>
> int xe_vm_invalidate_vma(struct xe_vma *vma);
>
> +int xe_vm_invalidate_vma_submit(struct xe_vma *vma, struct xe_tlb_inval_batch *batch);
> +
> int xe_vm_validate_protected(struct xe_vm *vm);
>
> static inline void xe_vm_queue_rebind_worker(struct xe_vm *vm)
> --
> 2.53.0
>
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH v2 4/4] drm/xe/userptr: Defer Waiting for TLB invalidation to the second pass if possible
2026-03-02 19:14 ` Matthew Brost
@ 2026-03-02 21:33 ` Thomas Hellström
0 siblings, 0 replies; 21+ messages in thread
From: Thomas Hellström @ 2026-03-02 21:33 UTC (permalink / raw)
To: Matthew Brost
Cc: intel-xe, Christian König, dri-devel, Jason Gunthorpe,
Andrew Morton, Simona Vetter, Dave Airlie, Alistair Popple,
linux-mm, linux-kernel
On Mon, 2026-03-02 at 11:14 -0800, Matthew Brost wrote:
> On Mon, Mar 02, 2026 at 05:32:48PM +0100, Thomas Hellström wrote:
> > Now that the two-pass notifier flow uses xe_vma_userptr_do_inval()
> > for
> > the fence-wait + TLB-invalidate work, extend it to support a
> > further
> > deferred TLB wait:
> >
> > - xe_vma_userptr_do_inval(): when the embedded finish handle is
> > free,
> > submit the TLB invalidation asynchronously
> > (xe_vm_invalidate_vma_submit)
> > and return &userptr->finish so the mmu_notifier core schedules a
> > third
> > pass. When the handle is occupied by a concurrent invalidation,
> > fall
> > back to the synchronous xe_vm_invalidate_vma() path.
> >
> > - xe_vma_userptr_complete_tlb_inval(): new helper called from
> > invalidate_finish when tlb_inval_submitted is set. Waits for the
> > previously submitted batch and unmaps the gpusvm pages.
> >
> > xe_vma_userptr_invalidate_finish() dispatches between the two
> > helpers
> > via tlb_inval_submitted, making the three possible flows explicit:
> >
> > pass1 (fences pending) -> invalidate_finish -> do_inval (sync
> > TLB)
> > pass1 (fences done) -> do_inval -> invalidate_finish
> > -> complete_tlb_inval (deferred TLB)
> > pass1 (finish occupied) -> do_inval (sync TLB, inline)
> >
> > In multi-GPU scenarios this allows TLB flushes to be submitted on
> > all
> > GPUs in one pass before any of them are waited on.
> >
> > Also adds xe_vm_invalidate_vma_submit() which submits the TLB range
> > invalidation without blocking, populating a xe_tlb_inval_batch that
> > the caller waits on separately.
> >
>
> As suggested in patch #2, maybe squash this into patch #2 as some of
> patch #2 is immediately tweaked / rewritten here.
>
> A couple nits.
>
> > Assisted-by: GitHub Copilot:claude-sonnet-4.6
> > Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > ---
> > drivers/gpu/drm/xe/xe_userptr.c | 60 +++++++++++++++++++++++++++--
> > ----
> > drivers/gpu/drm/xe/xe_userptr.h | 18 ++++++++++
> > drivers/gpu/drm/xe/xe_vm.c | 38 ++++++++++++++++-----
> > drivers/gpu/drm/xe/xe_vm.h | 2 ++
> > 4 files changed, 99 insertions(+), 19 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_userptr.c
> > b/drivers/gpu/drm/xe/xe_userptr.c
> > index 440b0a79d16f..a62b796afb93 100644
> > --- a/drivers/gpu/drm/xe/xe_userptr.c
> > +++ b/drivers/gpu/drm/xe/xe_userptr.c
> > @@ -8,6 +8,7 @@
> >
> > #include <linux/mm.h>
> >
> > +#include "xe_tlb_inval.h"
> > #include "xe_trace_bo.h"
> >
> > /**
> > @@ -73,8 +74,8 @@ int xe_vma_userptr_pin_pages(struct
> > xe_userptr_vma *uvma)
> > &ctx);
> > }
> >
> > -static void xe_vma_userptr_do_inval(struct xe_vm *vm, struct
> > xe_userptr_vma *uvma,
> > - bool is_deferred)
> > +static struct mmu_interval_notifier_finish *
> > +xe_vma_userptr_do_inval(struct xe_vm *vm, struct xe_userptr_vma
> > *uvma, bool is_deferred)
> > {
> > struct xe_userptr *userptr = &uvma->userptr;
> > struct xe_vma *vma = &uvma->vma;
> > @@ -84,12 +85,23 @@ static void xe_vma_userptr_do_inval(struct
> > xe_vm *vm, struct xe_userptr_vma *uvm
> > };
> > long err;
> >
> > - err = dma_resv_wait_timeout(xe_vm_resv(vm),
> > - DMA_RESV_USAGE_BOOKKEEP,
> > + err = dma_resv_wait_timeout(xe_vm_resv(vm),
> > DMA_RESV_USAGE_BOOKKEEP,
> > false, MAX_SCHEDULE_TIMEOUT);
>
> Unrelated.
Right, will fix.
>
> > XE_WARN_ON(err <= 0);
> >
> > if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
> > + if (!userptr->finish_inuse) {
>
> Since this is state machiney - should we have asserts on state? That
> typically the approach I take when I write stae machiney code. Self
> documenting plus immediatelt catches misuse.
>
> So here an example would be:
>
> xe_assert(.., !userptr->tlb_inval_submitted);
Sure. Can take a look at that to see how it turns out.
>
> > + /*
> > + * Defer the TLB wait to an extra pass so
> > the caller
> > + * can pipeline TLB flushes across GPUs
> > before waiting
> > + * on any of them.
> > + */
> > + userptr->finish_inuse = true;
> > + userptr->tlb_inval_submitted = true;
> > + err = xe_vm_invalidate_vma_submit(vma,
> > &userptr->inval_batch);
> > + XE_WARN_ON(err);
> > + return &userptr->finish;
> > + }
> > err = xe_vm_invalidate_vma(vma);
> > XE_WARN_ON(err);
> > }
> > @@ -98,6 +110,24 @@ static void xe_vma_userptr_do_inval(struct
> > xe_vm *vm, struct xe_userptr_vma *uvm
> > userptr->finish_inuse = false;
> > drm_gpusvm_unmap_pages(&vm->svm.gpusvm, &uvma-
> > >userptr.pages,
> > xe_vma_size(vma) >> PAGE_SHIFT,
> > &ctx);
> > + return NULL;
> > +}
> > +
> > +static void
> > +xe_vma_userptr_complete_tlb_inval(struct xe_vm *vm, struct
> > xe_userptr_vma *uvma)
> > +{
> > + struct xe_userptr *userptr = &uvma->userptr;
> > + struct xe_vma *vma = &uvma->vma;
> > + struct drm_gpusvm_ctx ctx = {
> > + .in_notifier = true,
> > + .read_only = xe_vma_read_only(vma),
> > + };
> > +
>
> xe_svm_assert_in_notifier();
See previous comment on this.
>
> State machine asserts could be:
>
> xe_assert(..., userptr->tlb_inval_submitted);
> xe_assert(..., userptr->finish_inuse);
Will take a look at this as well.
>
> > + xe_tlb_inval_batch_wait(&userptr->inval_batch);
> > + userptr->tlb_inval_submitted = false;
> > + userptr->finish_inuse = false;
> > + drm_gpusvm_unmap_pages(&vm->svm.gpusvm, &uvma-
> > >userptr.pages,
> > + xe_vma_size(vma) >> PAGE_SHIFT,
> > &ctx);
> > }
> >
> > static struct mmu_interval_notifier_finish *
> > @@ -141,13 +171,11 @@ xe_vma_userptr_invalidate_pass1(struct xe_vm
> > *vm, struct xe_userptr_vma *uvma)
> > * If it's already in use, or all fences are already
> > signaled,
> > * proceed directly to invalidation without deferring.
> > */
> > - if (signaled || userptr->finish_inuse) {
> > - xe_vma_userptr_do_inval(vm, uvma, false);
> > - return NULL;
> > - }
> > + if (signaled || userptr->finish_inuse)
> > + return xe_vma_userptr_do_inval(vm, uvma, false);
> >
> > + /* Defer: the notifier core will call invalidate_finish
> > once done. */
> > userptr->finish_inuse = true;
> > -
>
> Unrelated.
Will fix.
>
> > return &userptr->finish;
> > }
> >
> > @@ -193,7 +221,15 @@ static void
> > xe_vma_userptr_invalidate_finish(struct
> > mmu_interval_notifier_finish
> > xe_vma_start(vma), xe_vma_size(vma));
> >
> > down_write(&vm->svm.gpusvm.notifier_lock);
> > - xe_vma_userptr_do_inval(vm, uvma, true);
> > + /*
> > + * If a TLB invalidation was previously submitted
> > (deferred from the
> > + * synchronous pass1 fallback), wait for it and unmap
> > pages.
> > + * Otherwise, fences have now completed: invalidate the
> > TLB and unmap.
> > + */
> > + if (uvma->userptr.tlb_inval_submitted)
> > + xe_vma_userptr_complete_tlb_inval(vm, uvma);
> > + else
> > + xe_vma_userptr_do_inval(vm, uvma, true);
> > up_write(&vm->svm.gpusvm.notifier_lock);
> > trace_xe_vma_userptr_invalidate_complete(vma);
> > }
> > @@ -231,7 +267,9 @@ void xe_vma_userptr_force_invalidate(struct
> > xe_userptr_vma *uvma)
> >
> > finish = xe_vma_userptr_invalidate_pass1(vm, uvma);
> > if (finish)
> > - xe_vma_userptr_do_inval(vm, uvma, true);
> > + finish = xe_vma_userptr_do_inval(vm, uvma, true);
> > + if (finish)
> > + xe_vma_userptr_complete_tlb_inval(vm, uvma);
> > }
> > #endif
> >
> > diff --git a/drivers/gpu/drm/xe/xe_userptr.h
> > b/drivers/gpu/drm/xe/xe_userptr.h
> > index 4f42db61fd62..7477009651c2 100644
> > --- a/drivers/gpu/drm/xe/xe_userptr.h
> > +++ b/drivers/gpu/drm/xe/xe_userptr.h
> > @@ -14,6 +14,8 @@
> >
> > #include <drm/drm_gpusvm.h>
> >
> > +#include "xe_tlb_inval_types.h"
> > +
> > struct xe_vm;
> > struct xe_vma;
> > struct xe_userptr_vma;
> > @@ -63,6 +65,15 @@ struct xe_userptr {
> > * Protected by @vm::svm.gpusvm.notifier_lock.
> > */
> > struct mmu_interval_notifier_finish finish;
> > +
> > + /**
> > + * @inval_batch: TLB invalidation batch for deferred
> > completion.
> > + * Stores an in-flight TLB invalidation submitted during a
> > two-pass
> > + * notifier so the wait can be deferred to a subsequent
> > pass, allowing
> > + * multiple GPUs to be signalled before any of them are
> > waited on.
> > + * Protected by @vm::svm.gpusvm.notifier_lock.
>
> In write mode?
Yeah, This one and the one below will look a bit more complicated if we
want to keep the invalidation injection. Will update.
>
> > + */
> > + struct xe_tlb_inval_batch inval_batch;
> > /**
> > * @finish_inuse: Whether @finish is currently in use by
> > an in-progress
> > * two-pass invalidation.
> > @@ -70,6 +81,13 @@ struct xe_userptr {
> > */
> > bool finish_inuse;
> >
> > + /**
> > + * @tlb_inval_submitted: Whether a TLB invalidation has
> > been submitted
> > + * via @inval_batch and is pending completion. When set,
> > the next pass
> > + * must call xe_tlb_inval_batch_wait() before reusing
> > @inval_batch.
> > + * Protected by @vm::svm.gpusvm.notifier_lock.
>
> In write mode?
>
> Matt
Thanks,
Thomas
>
> > + */
> > + bool tlb_inval_submitted;
> > /**
> > * @initial_bind: user pointer has been bound at least
> > once.
> > * write: vm->svm.gpusvm.notifier_lock in read mode and
> > vm->resv held.
> > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> > b/drivers/gpu/drm/xe/xe_vm.c
> > index 7f29d2b2972d..fdad9329dfb4 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.c
> > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > @@ -3967,20 +3967,23 @@ void xe_vm_unlock(struct xe_vm *vm)
> > }
> >
> > /**
> > - * xe_vm_invalidate_vma - invalidate GPU mappings for VMA without
> > a lock
> > + * xe_vm_invalidate_vma_submit - Submit a job to invalidate GPU
> > mappings for
> > + * VMA.
> > * @vma: VMA to invalidate
> > + * @batch: TLB invalidation batch to populate; caller must later
> > call
> > + * xe_tlb_inval_batch_wait() on it to wait for completion
> > *
> > * Walks a list of page tables leaves which it memset the entries
> > owned by this
> > - * VMA to zero, invalidates the TLBs, and block until TLBs
> > invalidation is
> > - * complete.
> > + * VMA to zero, invalidates the TLBs, but doesn't block waiting
> > for TLB flush
> > + * to complete, but instead populates @batch which can be waited
> > on using
> > + * xe_tlb_inval_batch_wait().
> > *
> > * Returns 0 for success, negative error code otherwise.
> > */
> > -int xe_vm_invalidate_vma(struct xe_vma *vma)
> > +int xe_vm_invalidate_vma_submit(struct xe_vma *vma, struct
> > xe_tlb_inval_batch *batch)
> > {
> > struct xe_device *xe = xe_vma_vm(vma)->xe;
> > struct xe_vm *vm = xe_vma_vm(vma);
> > - struct xe_tlb_inval_batch _batch;
> > struct xe_tile *tile;
> > u8 tile_mask = 0;
> > int ret = 0;
> > @@ -4023,14 +4026,33 @@ int xe_vm_invalidate_vma(struct xe_vma
> > *vma)
> >
> > ret = xe_tlb_inval_range_tilemask_submit(xe,
> > xe_vma_vm(vma)->usm.asid,
> >
> > xe_vma_start(vma), xe_vma_end(vma),
> > - tile_mask,
> > &_batch);
> > + tile_mask,
> > batch);
> >
> > /* WRITE_ONCE pairs with READ_ONCE in
> > xe_vm_has_valid_gpu_mapping() */
> > WRITE_ONCE(vma->tile_invalidated, vma->tile_mask);
> > + return ret;
> > +}
> > +
> > +/**
> > + * xe_vm_invalidate_vma - invalidate GPU mappings for VMA without
> > a lock
> > + * @vma: VMA to invalidate
> > + *
> > + * Walks a list of page tables leaves which it memset the entries
> > owned by this
> > + * VMA to zero, invalidates the TLBs, and block until TLBs
> > invalidation is
> > + * complete.
> > + *
> > + * Returns 0 for success, negative error code otherwise.
> > + */
> > +int xe_vm_invalidate_vma(struct xe_vma *vma)
> > +{
> > + struct xe_tlb_inval_batch batch;
> > + int ret;
> >
> > - if (!ret)
> > - xe_tlb_inval_batch_wait(&_batch);
> > + ret = xe_vm_invalidate_vma_submit(vma, &batch);
> > + if (ret)
> > + return ret;
> >
> > + xe_tlb_inval_batch_wait(&batch);
> > return ret;
> > }
> >
> > diff --git a/drivers/gpu/drm/xe/xe_vm.h
> > b/drivers/gpu/drm/xe/xe_vm.h
> > index 62f4b6fec0bc..0bc7ed23eeae 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.h
> > +++ b/drivers/gpu/drm/xe/xe_vm.h
> > @@ -242,6 +242,8 @@ struct dma_fence *xe_vm_range_unbind(struct
> > xe_vm *vm,
> >
> > int xe_vm_invalidate_vma(struct xe_vma *vma);
> >
> > +int xe_vm_invalidate_vma_submit(struct xe_vma *vma, struct
> > xe_tlb_inval_batch *batch);
> > +
> > int xe_vm_validate_protected(struct xe_vm *vm);
> >
> > static inline void xe_vm_queue_rebind_worker(struct xe_vm *vm)
> > --
> > 2.53.0
> >
^ permalink raw reply [flat|nested] 21+ messages in thread
* Claude review: drm/xe/userptr: Defer Waiting for TLB invalidation to the second pass if possible
2026-03-02 16:32 ` [PATCH v2 4/4] drm/xe/userptr: Defer Waiting for TLB invalidation to the second pass if possible Thomas Hellström
2026-03-02 19:14 ` Matthew Brost
@ 2026-03-03 3:05 ` Claude Code Review Bot
1 sibling, 0 replies; 21+ messages in thread
From: Claude Code Review Bot @ 2026-03-03 3:05 UTC (permalink / raw)
To: dri-devel-reviews
Patch Review
This is the payoff patch. It extends the two-pass model to also defer TLB invalidation waits. However, the implementation effectively creates a **three-pass** flow using a **two-pass** infrastructure, which is somewhat surprising.
**Three-pass flow via two-pass API:**
The commit message describes three flows:
```
pass1 (fences pending) -> invalidate_finish -> do_inval (sync TLB)
pass1 (fences done) -> do_inval -> invalidate_finish -> complete_tlb_inval (deferred TLB)
pass1 (finish occupied) -> do_inval (sync TLB, inline)
```
In the second flow, `xe_vma_userptr_do_inval` returns `&userptr->finish`, which means it's requesting another finish callback. But this finish pointer goes through `xe_vma_userptr_invalidate_finish` which checks `tlb_inval_submitted` and dispatches to `xe_vma_userptr_complete_tlb_inval`. This works because `xe_vma_userptr_do_inval` is called from `xe_vma_userptr_invalidate_finish` which is called from `mn_itree_finish_pass`.
**Wait - this doesn't actually achieve the third pass.** Looking at `mn_itree_finish_pass`:
```c
static void mn_itree_finish_pass(struct llist_head *finish_passes)
{
struct llist_node *first = llist_reverse_order(__llist_del_all(finish_passes));
struct mmu_interval_notifier_finish *f, *next;
llist_for_each_entry_safe(f, next, first, link)
f->notifier->ops->invalidate_finish(f);
}
```
The `invalidate_finish` callback can return a new `finish` pointer from `xe_vma_userptr_do_inval`, but **nobody collects it**. The finish callback is `void` - it doesn't return a pointer. So the flow where `xe_vma_userptr_do_inval` returns `&userptr->finish` from within the `invalidate_finish` callback is problematic: the returned pointer goes to the caller `xe_vma_userptr_invalidate_finish`, which ignores it (the function is `void`).
**Wait, re-reading `xe_vma_userptr_invalidate_finish`:**
```c
static void xe_vma_userptr_invalidate_finish(struct mmu_interval_notifier_finish *finish)
{
...
down_write(&vm->svm.gpusvm.notifier_lock);
if (uvma->userptr.tlb_inval_submitted)
xe_vma_userptr_complete_tlb_inval(vm, uvma);
else
xe_vma_userptr_do_inval(vm, uvma, true);
up_write(&vm->svm.gpusvm.notifier_lock);
...
}
```
When `tlb_inval_submitted` is false (the common case for the "fences pending -> finish" path), it calls `xe_vma_userptr_do_inval(vm, uvma, true)`. That function now returns a `struct mmu_interval_notifier_finish *`, but the return value is **ignored** here. If `xe_vma_userptr_do_inval` returns a non-NULL finish (because it submitted TLB invalidation asynchronously and set `finish_inuse = true`, `tlb_inval_submitted = true`), then:
1. The TLB invalidation is submitted but never waited on
2. `finish_inuse` is set to true and never cleared
3. `tlb_inval_submitted` is set to true and never cleared
**This appears to be a bug.** The deferred TLB path from within `invalidate_finish` has no mechanism to schedule another finish callback because the two-pass infrastructure only supports... two passes. The three-pass flow described in the commit message doesn't actually work as described.
Let me re-examine more carefully... Actually, looking again at `xe_vma_userptr_do_inval`:
```c
if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
if (!userptr->finish_inuse) {
userptr->finish_inuse = true;
userptr->tlb_inval_submitted = true;
err = xe_vm_invalidate_vma_submit(vma, &userptr->inval_batch);
XE_WARN_ON(err);
return &userptr->finish;
}
err = xe_vm_invalidate_vma(vma);
XE_WARN_ON(err);
}
if (is_deferred)
userptr->finish_inuse = false;
```
When called from `xe_vma_userptr_invalidate_finish` (i.e., `is_deferred = true`), `finish_inuse` is still `true` (it was set in pass1). So the `!userptr->finish_inuse` check fails, and it falls through to the synchronous `xe_vm_invalidate_vma` path, then clears `finish_inuse`. So the three-pass flow **doesn't actually happen** for the "fences pending" case. The deferred TLB path only triggers when `signaled == true` in pass1 (all fences already done, so do_inval is called inline in pass1).
So the actual flows are:
1. **Fences pending**: pass1 defers -> finish calls do_inval -> `finish_inuse` is true -> falls through to sync TLB -> returns NULL (ignored, but correctly)
2. **Fences done, finish_inuse == false**: pass1 calls do_inval inline -> do_inval sees `!finish_inuse` -> submits TLB async -> returns finish ptr -> pass1 returns this to notifier core -> finish calls `complete_tlb_inval`
3. **Fences done, finish_inuse == true**: pass1 calls do_inval inline -> falls through to sync TLB -> returns NULL
This is actually correct! But flow #2 means the "two-pass" from the MMU notifier perspective is: pass1 (enable sw signaling + submit TLB) -> pass2 (wait for TLB). The dma_resv wait doesn't happen because fences are already signaled. My initial analysis was wrong about the bug.
However, this is still confusing. The `is_deferred` parameter controls whether `finish_inuse` is cleared. In flow #2, `is_deferred = false` (called from pass1), so `finish_inuse` is NOT cleared at the bottom of `do_inval`. It remains true, and is cleared later in `xe_vma_userptr_complete_tlb_inval`. This is correct but tricky.
**`xe_vma_userptr_force_invalidate` chain in patch 4:**
```c
finish = xe_vma_userptr_invalidate_pass1(vm, uvma);
if (finish)
finish = xe_vma_userptr_do_inval(vm, uvma, true);
if (finish)
xe_vma_userptr_complete_tlb_inval(vm, uvma);
```
Note that `finish` here is the `static` variable from patch 2 (still a bug). But the logic is: if pass1 deferred, do the inval; if that deferred too, complete TLB. This correctly handles all three steps synchronously for the force-invalidate testing path.
**Embedding `xe_tlb_inval_batch` in `xe_userptr`:**
```c
struct xe_tlb_inval_batch inval_batch;
```
This embeds 4 `xe_tlb_inval_fence` structs (~400-500 bytes) in every `xe_userptr`. This is a significant memory overhead per-VMA. Consider whether this should be dynamically allocated when needed, or whether the cost is justified by avoiding allocation in the notifier callback path.
**Summary of issues in patch 4:**
1. The `static` variable from patch 2 is still present (bug, should be fixed in patch 2)
2. The code flow is complex and the commit message's three-flow description is somewhat misleading about when each flow actually triggers
3. The return value of `xe_vma_userptr_do_inval` is ignored in `xe_vma_userptr_invalidate_finish` - this is actually correct behavior (it will always return NULL in that context because `finish_inuse` is already set), but deserves a comment
4. Per-VMA memory overhead from embedding `xe_tlb_inval_batch`
---
**Summary of actionable items:**
1. **Bug (Patch 2):** Remove `static` from `finish` variable in `xe_vma_userptr_force_invalidate`
2. **Minor (Patch 1):** Consider adding `WARN_ON` if both `invalidate` and `invalidate_start` are set in ops during registration
3. **Minor (Patch 3):** Inconsistent local variable naming (`_batch` vs `batch`)
4. **Minor (Patch 3):** `xe_tlb_inval_types.h` including `xe_device_types.h` is heavy for a types header
5. **Documentation (Patch 4):** Add a comment in `xe_vma_userptr_invalidate_finish` explaining that `xe_vma_userptr_do_inval` always returns NULL when called from the finish path (because `finish_inuse` is already set)
6. **Design consideration (Patch 4):** Memory cost of embedding `xe_tlb_inval_batch` in every `xe_userptr`
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 21+ messages in thread
* Claude review: Two-pass MMU interval notifiers
2026-03-02 16:32 [PATCH v2 0/4] Two-pass MMU interval notifiers Thomas Hellström
` (3 preceding siblings ...)
2026-03-02 16:32 ` [PATCH v2 4/4] drm/xe/userptr: Defer Waiting for TLB invalidation to the second pass if possible Thomas Hellström
@ 2026-03-03 3:05 ` Claude Code Review Bot
4 siblings, 0 replies; 21+ messages in thread
From: Claude Code Review Bot @ 2026-03-03 3:05 UTC (permalink / raw)
To: dri-devel-reviews
Overall Series Review
Subject: Two-pass MMU interval notifiers
Author: =?UTF-8?q?Thomas=20Hellstr=C3=B6m?= <thomas.hellstrom@linux.intel.com>
Patches: 15
Reviewed: 2026-03-03T13:05:43.927349
---
This series adds two-pass MMU interval notifier support to enable overlapping GPU operations (TLB flushes, context preemption) across multiple GPUs in multi-GPU scenarios. The motivation is sound: instead of serializing notify-then-wait per-GPU, the first pass kicks off operations on all GPUs, and the second pass waits for them all. The series is well-structured, with patch 1 providing the core mm infrastructure, patch 2 converting xe userptr to the two-pass model, patch 3 refactoring TLB invalidation into submit/wait halves, and patch 4 wiring up deferred TLB waits through the two-pass mechanism.
Overall the design is reasonable and the code is generally clean. There are a few issues to address, ranging from a real bug (static local variable in patch 2) to design concerns about the three-pass usage in patch 4 abusing the two-pass infrastructure.
---
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 21+ messages in thread