[PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation

public inbox for drm-ai-reviews@public-inbox.freedesktop.org
 help / color / mirror / Atom feed

* [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation
@ 2026-04-30 19:18 Matthew Brost
  2026-04-30 19:18 ` [PATCH v4 1/6] mm: Wire up order in shrink_control Matthew Brost
                   ` (8 more replies)
  0 siblings, 9 replies; 24+ messages in thread
From: Matthew Brost @ 2026-04-30 19:18 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: Dave Chinner, Qi Zheng, Roman Gushchin, Johannes Weiner,
	Shakeel Butt, Kairui Song, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Tvrtko Ursulin, Thomas Hellström,
	Carlos Santa, Christian Koenig, Huang Rui, Matthew Auld,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Daniel Colascione, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel

TTM allocations at higher orders can drive Xe into a pathological
reclaim loop when memory is fragmented:

kswapd → shrinker → eviction → rebind (exec ioctl) → repeat

In this state, reclaim is triggered despite substantial free memory,
but fails to produce contiguous higher-order pages. The Xe shrinker then
evicts active buffer objects, increasing faulting and rebind activity
and further feeding the loop. The result is high CPU overhead and poor
GPU forward progress.

This issue was first reported in [1] and independently observed
internally and by Google.

A simple reproducer is:

- Boot an iGPU system with mem=8G
- Launch 10 Chrome tabs running the WebGL aquarium demo
- Configure each tab with ~5k fish

Under this workload, ftrace shows a continuous loop of:

xe_shrinker_scan (kswapd)
xe_vma_rebind_exec

Performance degrades significantly, with each tab dropping to ~2 FPS on
PTL (Ubuntu 24.04).

At the same time, /proc/buddyinfo shows substantial free memory but no
higher-order availability. For example, the Normal zone:

Count: 4063 4595 3455 3400 3139 2762 2293 1655 643 0 0

This corresponds to ~2.8GB free memory, but no order-9 (2MB) blocks,
indicating severe fragmentation.

This series addresses the issue in two ways:

TTM: Restrict direct reclaim to beneficial_order. Larger allocations
use __GFP_NORETRY to fail quickly rather than triggering reclaim.

Xe: Introduce a heuristic in the shrinker to avoid eviction when
running under kswapd and the system appears memory-rich but
fragmented.

With these changes, the reclaim/eviction loop is eliminated. The same
workload improves to ~10 FPS per tab (Ubuntu 24.04) or ~15 FPS per tab
(Ubuntu 24.10), and kswapd activity subsides.

Buddyinfo after applying this series shows restored higher-order
availability:

Count: 8526 7067 3092 1959 1292 660 194 28 20 13 1

Matt

v2:
 - Layer with core MM / TTM helpers (Thomas)
v4:
 - Fix build (CI)

[1] https://patchwork.freedesktop.org/patch/716404/?series=164353&rev=1

Cc: Dave Chinner <david@fromorbit.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Kairui Song <kasong@tencent.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Carlos Santa <carlos.santa@intel.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
CC: dri-devel@lists.freedesktop.org
Cc: Daniel Colascione <dancol@dancol.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org

Matthew Brost (6):
  mm: Wire up order in shrink_control
  mm: Introduce zone_maybe_fragmented_in_shrinker()
  drm/ttm: Issue direct reclaim at beneficial_order
  drm/ttm: Introduce ttm_bo_shrink_kswap_maybe_fragmented()
  drm/xe: Set TTM device beneficial_order to 9 (2M)
  drm/xe: Avoid shrinker reclaim from kswapd under fragmentation

 drivers/gpu/drm/ttm/ttm_bo_util.c | 38 +++++++++++++++++++++++++++++++
 drivers/gpu/drm/ttm/ttm_pool.c    |  4 ++--
 drivers/gpu/drm/xe/xe_device.c    |  3 ++-
 drivers/gpu/drm/xe/xe_shrinker.c  |  3 +++
 include/drm/ttm/ttm_bo.h          |  2 ++
 include/linux/shrinker.h          |  3 +++
 include/linux/vmstat.h            | 12 ++++++++++
 mm/internal.h                     |  4 ++--
 mm/shrinker.c                     | 13 +++++++----
 mm/vmscan.c                       |  7 +++---
 10 files changed, 76 insertions(+), 13 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v4 1/6] mm: Wire up order in shrink_control
  2026-04-30 19:18 [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost
@ 2026-04-30 19:18 ` Matthew Brost
  2026-05-05  0:00   ` Claude review: " Claude Code Review Bot
  2026-04-30 19:18 ` [PATCH v4 2/6] mm: Introduce zone_maybe_fragmented_in_shrinker() Matthew Brost
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Matthew Brost @ 2026-04-30 19:18 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: Andrew Morton, Dave Chinner, Qi Zheng, Roman Gushchin,
	Muchun Song, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Johannes Weiner, Shakeel Butt, Kairui Song, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-mm, linux-kernel,
	Thomas Hellström

Pass the allocation order through shrink_control so shrinkers have
visibility into the order that triggered reclaim.

This allows shrinkers to implement better heuristics, such as detecting
high-order allocation pressure or fragmentation and avoiding eviction
of working sets when reclaim is invoked from kswapd.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Kairui Song <kasong@tencent.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Suggested-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>

---
v4: Fix build without CONFIG_MEMCG (CI)
---
 include/linux/shrinker.h |  3 +++
 mm/internal.h            |  4 ++--
 mm/shrinker.c            | 13 ++++++++-----
 mm/vmscan.c              |  7 ++++---
 4 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 1a00be90d93a..7072f693b9be 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -37,6 +37,9 @@ struct shrink_control {
 	/* current node being shrunk (for NUMA aware shrinkers) */
 	int nid;
 
+	/* Allocation order we are currently trying to fulfil. */
+	s8 order;
+
 	/*
 	 * How many objects scan_objects should scan and try to reclaim.
 	 * This is reset before every call, so it is safe for callees
diff --git a/mm/internal.h b/mm/internal.h
index 5a2ddcf68e0b..ff8671dccf7b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1759,8 +1759,8 @@ void __meminit __init_single_page(struct page *page, unsigned long pfn,
 void __meminit __init_page_from_nid(unsigned long pfn, int nid);
 
 /* shrinker related functions */
-unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg,
-			  int priority);
+unsigned long shrink_slab(gfp_t gfp_mask, int nid, s8 order,
+			  struct mem_cgroup *memcg, int priority);
 
 int shmem_add_to_page_cache(struct folio *folio,
 			    struct address_space *mapping,
diff --git a/mm/shrinker.c b/mm/shrinker.c
index 76b3f750cf65..c83f3b3daa08 100644
--- a/mm/shrinker.c
+++ b/mm/shrinker.c
@@ -466,7 +466,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 }
 
 #ifdef CONFIG_MEMCG
-static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
+static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, s8 order,
 			struct mem_cgroup *memcg, int priority)
 {
 	struct shrinker_info *info;
@@ -528,6 +528,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 			struct shrink_control sc = {
 				.gfp_mask = gfp_mask,
 				.nid = nid,
+				.order = order,
 				.memcg = memcg,
 			};
 			struct shrinker *shrinker;
@@ -587,7 +588,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 	return freed;
 }
 #else /* !CONFIG_MEMCG */
-static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
+static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, s8 order,
 			struct mem_cgroup *memcg, int priority)
 {
 	return 0;
@@ -598,6 +599,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
  * shrink_slab - shrink slab caches
  * @gfp_mask: allocation context
  * @nid: node whose slab caches to target
+ * @order: order of allocation
  * @memcg: memory cgroup whose slab caches to target
  * @priority: the reclaim priority
  *
@@ -614,8 +616,8 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
  *
  * Returns the number of reclaimed slab objects.
  */
-unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg,
-			  int priority)
+unsigned long shrink_slab(gfp_t gfp_mask, int nid, s8 order,
+			  struct mem_cgroup *memcg, int priority)
 {
 	unsigned long ret, freed = 0;
 	struct shrinker *shrinker;
@@ -628,7 +630,7 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg,
 	 * oom.
 	 */
 	if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
-		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
+		return shrink_slab_memcg(gfp_mask, nid, order, memcg, priority);
 
 	/*
 	 * lockless algorithm of global shrink.
@@ -656,6 +658,7 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg,
 		struct shrink_control sc = {
 			.gfp_mask = gfp_mask,
 			.nid = nid,
+			.order = order,
 			.memcg = memcg,
 		};
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bd1b1aa12581..a54d14ecad25 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -412,7 +412,7 @@ static unsigned long drop_slab_node(int nid)
 
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
 	do {
-		freed += shrink_slab(GFP_KERNEL, nid, memcg, 0);
+		freed += shrink_slab(GFP_KERNEL, nid, 0, memcg, 0);
 	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
 
 	return freed;
@@ -5068,7 +5068,8 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
 
 	success = try_to_shrink_lruvec(lruvec, sc);
 
-	shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
+	shrink_slab(sc->gfp_mask, pgdat->node_id, sc->order, memcg,
+		    sc->priority);
 
 	if (!sc->proactive)
 		vmpressure(sc->gfp_mask, memcg, false, sc->nr_scanned - scanned,
@@ -6170,7 +6171,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 
 		shrink_lruvec(lruvec, sc);
 
-		shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
+		shrink_slab(sc->gfp_mask, pgdat->node_id, sc->order, memcg,
 			    sc->priority);
 
 		/* Record the group's reclaim efficiency */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 2/6] mm: Introduce zone_maybe_fragmented_in_shrinker()
  2026-04-30 19:18 [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost
  2026-04-30 19:18 ` [PATCH v4 1/6] mm: Wire up order in shrink_control Matthew Brost
@ 2026-04-30 19:18 ` Matthew Brost
  2026-05-01  0:50   ` Santa, Carlos
                     ` (2 more replies)
  2026-04-30 19:18 ` [PATCH v4 3/6] drm/ttm: Issue direct reclaim at beneficial_order Matthew Brost
                   ` (6 subsequent siblings)
  8 siblings, 3 replies; 24+ messages in thread
From: Matthew Brost @ 2026-04-30 19:18 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: Thomas Hellström, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel

Introduce zone_maybe_fragmented_in_shrinker() as a lightweight helper to
allow subsystems to make coarse decisions about reclaim behavior in the
presence of likely fragmentation.

The helper implements a simple heuristic: if the number of free pages
in a zone exceeds twice the high watermark, the zone is considered to
have ample free memory and allocation failures are more likely due to
fragmentation than overall memory pressure.

This is intentionally imprecise and is not meant to replace the core
MM compaction or fragmentation accounting logic. Instead, it provides
a cheap signal for callers (e.g., shrinkers) that wish to avoid
overly aggressive reclaim when sufficient free memory exists but
high-order allocations may still fail.

No functional changes; this is a preparatory helper for future users.

Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>

---

v3: s/zone_appear_fragmented/zone_maybe_fragmented_in_shrinker (David
    Hildenbrand)
---
 include/linux/vmstat.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 3c9c266cf782..1ad48f70c9d9 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -483,6 +483,18 @@ static inline const char *zone_stat_name(enum zone_stat_item item)
 	return vmstat_text[item];
 }
 
+static inline bool zone_maybe_fragmented_in_shrinker(struct zone *zone)
+{
+	/*
+	 * Simple heuristic: if the number of free pages is more than twice the
+	 * high watermark, this may suggest that the zone is heavily fragmented.
+	 * When called from a shrinker, aggressively evicting memory in this case
+	 * may do more harm to overall system performance than good.
+	 */
+	return zone_page_state(zone, NR_FREE_PAGES) >
+		high_wmark_pages(zone) * 2;
+}
+
 #ifdef CONFIG_NUMA
 static inline const char *numa_stat_name(enum numa_stat_item item)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 3/6] drm/ttm: Issue direct reclaim at beneficial_order
  2026-04-30 19:18 [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost
  2026-04-30 19:18 ` [PATCH v4 1/6] mm: Wire up order in shrink_control Matthew Brost
  2026-04-30 19:18 ` [PATCH v4 2/6] mm: Introduce zone_maybe_fragmented_in_shrinker() Matthew Brost
@ 2026-04-30 19:18 ` Matthew Brost
  2026-05-04  7:16   ` Christian König
  2026-05-05  0:00   ` Claude review: " Claude Code Review Bot
  2026-04-30 19:18 ` [PATCH v4 4/6] drm/ttm: Introduce ttm_bo_shrink_kswap_maybe_fragmented() Matthew Brost
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 24+ messages in thread
From: Matthew Brost @ 2026-04-30 19:18 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: Tvrtko Ursulin, Thomas Hellström, Carlos Santa,
	Christian Koenig, Huang Rui, Matthew Auld, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Daniel Colascione, Andi Shyti

Triggering kswap at an order higher than beneficial_order makes little
sense, as the driver has already indicated the optimal order at which
reclaim is effective. Similarly, issuing direct reclaim or triggering
kswap at a lower order than beneficial_order is ineffective, since the
driver does not benefit from reclaiming lower-order pages.

As a result, direct reclaim should only be issued with __GFP_NORETRY at
exactly beneficial_order, or as a fallback, direct reclaim without
__GFP_NORETRY at order 0 when failure is not an option.

Cc: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Carlos Santa <carlos.santa@intel.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
CC: dri-devel@lists.freedesktop.org
Cc: Daniel Colascione <dancol@dancol.org>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Christian Koenig <christian.koenig@amd.com>
Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
---
 drivers/gpu/drm/ttm/ttm_pool.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index 26a3689e5fd9..8425dbcc6c68 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -165,8 +165,8 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
 	 * Do not add latency to the allocation path for allocations orders
 	 * device tolds us do not bring them additional performance gains.
 	 */
-	if (beneficial_order && order > beneficial_order)
-		gfp_flags &= ~__GFP_DIRECT_RECLAIM;
+	if (order && beneficial_order && order != beneficial_order)
+		gfp_flags &= ~__GFP_RECLAIM;
 
 	if (!ttm_pool_uses_dma_alloc(pool)) {
 		p = alloc_pages_node(pool->nid, gfp_flags, order);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 4/6] drm/ttm: Introduce ttm_bo_shrink_kswap_maybe_fragmented()
  2026-04-30 19:18 [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost
                   ` (2 preceding siblings ...)
  2026-04-30 19:18 ` [PATCH v4 3/6] drm/ttm: Issue direct reclaim at beneficial_order Matthew Brost
@ 2026-04-30 19:18 ` Matthew Brost
  2026-05-05  0:00   ` Claude review: " Claude Code Review Bot
  2026-04-30 19:18 ` [PATCH v4 5/6] drm/xe: Set TTM device beneficial_order to 9 (2M) Matthew Brost
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Matthew Brost @ 2026-04-30 19:18 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: Thomas Hellström, Carlos Santa, Christian Koenig, Huang Rui,
	Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, Daniel Colascione

Introduce ttm_bo_shrink_kswap_maybe_fragmented() to allow TTM users to
distinguish background reclaim from kswapd that is likely driven by
high-order allocation failures under fragmentation.

The helper returns true when:
  - order of shrinker invocation is zero
  - reclaim is running in kswapd, and
  - the target node is valid, and
  - one of the relevant zones reports free pages significantly above
    its high watermark (via zone_appears_fragmented()).

This provides a coarse signal that overall free memory is available,
and that reclaim activity may be driven by fragmentation rather than
true memory pressure.

The intent is to allow drivers to adjust shrinker behavior in this
case, for example by preferring purgeable or low-value objects instead
of aggressively evicting active working sets in the background reclaim
path.

The heuristic is intentionally simple and conservative, and is not
intended to replace core MM fragmentation or compaction decisions.

No functional change; this is a preparatory helper for TTM users.

Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Carlos Santa <carlos.santa@intel.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
CC: dri-devel@lists.freedesktop.org
Cc: Daniel Colascione <dancol@dancol.org>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>

---

v3:
 - s/ttm_bo_shrink_kswap_fragmented/ttm_bo_shrink_kswap_maybe_fragmented
   (Andi)
 - Wire in order (Thomas)
---
 drivers/gpu/drm/ttm/ttm_bo_util.c | 38 +++++++++++++++++++++++++++++++
 include/drm/ttm/ttm_bo.h          |  2 ++
 2 files changed, 40 insertions(+)

diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c
index f83b7d5ec6c6..a6a4255c10cc 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_util.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
@@ -1169,3 +1169,41 @@ bool ttm_bo_shrink_avoid_wait(void)
 	return !current_is_kswapd();
 }
 EXPORT_SYMBOL(ttm_bo_shrink_avoid_wait);
+
+/**
+ * ttm_bo_shrink_kswap_maybe_fragmented() - Whether in kswap and memory might be
+ * fragmented
+ * @nid: current node being shrunk
+ * @order: order of shrinker invocation
+ *
+ * Return: true if in kswap and memory appears fragmented, false is not.
+ */
+bool ttm_bo_shrink_kswap_maybe_fragmented(int nid, s8 order)
+{
+	enum zone_type zone_type;
+
+	if (!order)
+		return false;
+
+	if (!current_is_kswapd())
+		return false;
+
+	if (!numa_valid_node(nid))
+		return false;
+
+#if IS_ENABLED(CONFIG_ZONE_DMA32)
+	zone_type = ZONE_DMA32;
+#else
+	zone_type = ZONE_NORMAL;
+#endif
+
+	for (; zone_type <= ZONE_NORMAL; ++zone_type) {
+		struct zone *zone = &NODE_DATA(nid)->node_zones[zone_type];
+
+		if (zone_maybe_fragmented_in_shrinker(zone))
+			return true;
+	}
+
+	return false;
+}
+EXPORT_SYMBOL(ttm_bo_shrink_kswap_maybe_fragmented);
diff --git a/include/drm/ttm/ttm_bo.h b/include/drm/ttm/ttm_bo.h
index 8310bc3d55f9..4d00f9aa90a1 100644
--- a/include/drm/ttm/ttm_bo.h
+++ b/include/drm/ttm/ttm_bo.h
@@ -262,6 +262,8 @@ bool ttm_bo_shrink_suitable(struct ttm_buffer_object *bo, struct ttm_operation_c
 
 bool ttm_bo_shrink_avoid_wait(void);
 
+bool ttm_bo_shrink_kswap_maybe_fragmented(int nid, s8 order);
+
 /**
  * ttm_bo_reserve:
  *
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 5/6] drm/xe: Set TTM device beneficial_order to 9 (2M)
  2026-04-30 19:18 [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost
                   ` (3 preceding siblings ...)
  2026-04-30 19:18 ` [PATCH v4 4/6] drm/ttm: Introduce ttm_bo_shrink_kswap_maybe_fragmented() Matthew Brost
@ 2026-04-30 19:18 ` Matthew Brost
  2026-05-05  0:00   ` Claude review: " Claude Code Review Bot
  2026-04-30 19:18 ` [PATCH v4 6/6] drm/xe: Avoid shrinker reclaim from kswapd under fragmentation Matthew Brost
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Matthew Brost @ 2026-04-30 19:18 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: Thomas Hellström, Carlos Santa, Matthew Auld, Andi Shyti

Set the TTM device beneficial_order to 9 (2M), which is the sweet
spot for Xe when attempting reclaim on system memory BOs, as it matches
the large GPU page size. This ensures reclaim is attempted at the most
effective order for the driver.

Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Carlos Santa <carlos.santa@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
---
 drivers/gpu/drm/xe/xe_device.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 4b45b617a039..3f719ab08d1c 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -500,7 +500,8 @@ struct xe_device *xe_device_create(struct pci_dev *pdev,
 
 	err = ttm_device_init(&xe->ttm, &xe_ttm_funcs, xe->drm.dev,
 			      xe->drm.anon_inode->i_mapping,
-			      xe->drm.vma_offset_manager, 0);
+			      xe->drm.vma_offset_manager,
+			      TTM_ALLOCATION_POOL_BENEFICIAL_ORDER(get_order(SZ_2M)));
 	if (WARN_ON(err))
 		return ERR_PTR(err);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 6/6] drm/xe: Avoid shrinker reclaim from kswapd under fragmentation
  2026-04-30 19:18 [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost
                   ` (4 preceding siblings ...)
  2026-04-30 19:18 ` [PATCH v4 5/6] drm/xe: Set TTM device beneficial_order to 9 (2M) Matthew Brost
@ 2026-04-30 19:18 ` Matthew Brost
  2026-05-05  0:00   ` Claude review: " Claude Code Review Bot
  2026-04-30 23:01 ` [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops " Andrew Morton
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Matthew Brost @ 2026-04-30 19:18 UTC (permalink / raw)
  To: intel-xe, dri-devel; +Cc: Thomas Hellström, Carlos Santa, Matthew Auld

When the Xe shrinker is invoked from kswapd, a large amount of free
memory in usable zones relative to the high watermark is a strong
signal that reclaim is being driven by fragmentation rather than true
memory pressure.

In this case, shrinking Xe memory is unlikely to help kswapd make
forward progress. Instead it can evict active GPU memory despite the
system still having substantial free memory, increasing residency churn
and reducing GPU forward progress.

Detect this case and bail out early from the Xe shrinker when running in
kswapd, shrinker invocation is a higher order, and any usable zone has
more than 2x its high watermark free.

Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Carlos Santa <carlos.santa@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>

---

v3:
 - Wire in order to heuristic (Thomas)
---
 drivers/gpu/drm/xe/xe_shrinker.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_shrinker.c b/drivers/gpu/drm/xe/xe_shrinker.c
index 83374cd57660..792e0e216442 100644
--- a/drivers/gpu/drm/xe/xe_shrinker.c
+++ b/drivers/gpu/drm/xe/xe_shrinker.c
@@ -236,6 +236,9 @@ static unsigned long xe_shrinker_scan(struct shrinker *shrink, struct shrink_con
 	if (nr_scanned >= nr_to_scan || !can_backup)
 		goto out;
 
+	if (ttm_bo_shrink_kswap_maybe_fragmented(sc->nid, sc->order))
+		goto out;
+
 	/* If we didn't wake before, try to do it now if needed. */
 	if (!runtime_pm)
 		runtime_pm = xe_shrinker_runtime_pm_get(shrinker, true, 0, can_backup);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation
  2026-04-30 19:18 [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost
                   ` (5 preceding siblings ...)
  2026-04-30 19:18 ` [PATCH v4 6/6] drm/xe: Avoid shrinker reclaim from kswapd under fragmentation Matthew Brost
@ 2026-04-30 23:01 ` Andrew Morton
  2026-05-01  6:28   ` Matthew Brost
  2026-05-01  1:42 ` Dave Chinner
  2026-05-05  0:00 ` Claude review: " Claude Code Review Bot
  8 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2026-04-30 23:01 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, Dave Chinner, Qi Zheng, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Kairui Song, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Tvrtko Ursulin,
	Thomas Hellström, Carlos Santa, Christian Koenig, Huang Rui,
	Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, Daniel Colascione, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel

On Thu, 30 Apr 2026 12:18:03 -0700 Matthew Brost <matthew.brost@intel.com> wrote:

> TTM allocations at higher orders can drive Xe into a pathological
> reclaim loop when memory is fragmented:
> 
> kswapd → shrinker → eviction → rebind (exec ioctl) → repeat
> 
> In this state, reclaim is triggered despite substantial free memory,
> but fails to produce contiguous higher-order pages. The Xe shrinker then
> evicts active buffer objects, increasing faulting and rebind activity
> and further feeding the loop. The result is high CPU overhead and poor
> GPU forward progress.
> 
> ...
>
> This series addresses the issue in two ways:
> 
> TTM: Restrict direct reclaim to beneficial_order. Larger allocations
> use __GFP_NORETRY to fail quickly rather than triggering reclaim.
> 
> Xe: Introduce a heuristic in the shrinker to avoid eviction when
> running under kswapd and the system appears memory-rich but
> fragmented.

Please cc everyone on all the patches?  It's kind of annoying to have
to hunt around to find out how these proposed changes will be used. 
Personal preference, anyway.

AI review flagged a few possible issues:
	https://sashiko.dev/#/patchset/20260430191809.2142544-1-matthew.brost@intel.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 2/6] mm: Introduce zone_maybe_fragmented_in_shrinker()
  2026-04-30 19:18 ` [PATCH v4 2/6] mm: Introduce zone_maybe_fragmented_in_shrinker() Matthew Brost
@ 2026-05-01  0:50   ` Santa, Carlos
       [not found]   ` <f25f27a1-bf09-44bd-9b37-49f159d82d6a@panix.com>
  2026-05-05  0:00   ` Claude review: mm: Introduce zone_maybe_fragmented_in_shrinker() Claude Code Review Bot
  2 siblings, 0 replies; 24+ messages in thread
From: Santa, Carlos @ 2026-05-01  0:50 UTC (permalink / raw)
  To: intel-xe@lists.freedesktop.org, Brost, Matthew,
	dri-devel@lists.freedesktop.org
  Cc: linux-kernel@vger.kernel.org, Liam.Howlett@oracle.com,
	david@kernel.org, surenb@google.com, akpm@linux-foundation.org,
	thomas.hellstrom@linux.intel.com, ljs@kernel.org,
	vbabka@kernel.org, linux-mm@kvack.org, rppt@kernel.org,
	mhocko@suse.com

On Thu, 2026-04-30 at 12:18 -0700, Matthew Brost wrote:
> Introduce zone_maybe_fragmented_in_shrinker() as a lightweight helper
> to
> allow subsystems to make coarse decisions about reclaim behavior in
> the
> presence of likely fragmentation.
> 
> The helper implements a simple heuristic: if the number of free pages
> in a zone exceeds twice the high watermark, the zone is considered to
> have ample free memory and allocation failures are more likely due to
> fragmentation than overall memory pressure.
> 
> This is intentionally imprecise and is not meant to replace the core
> MM compaction or fragmentation accounting logic. Instead, it provides
> a cheap signal for callers (e.g., shrinkers) that wish to avoid
> overly aggressive reclaim when sufficient free memory exists but
> high-order allocations may still fail.
> 
> No functional changes; this is a preparatory helper for future users.
> 
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@kernel.org>
> Cc: Lorenzo Stoakes <ljs@kernel.org>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Vlastimil Babka <vbabka@kernel.org>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> 
> ---
> 
> v3: s/zone_appear_fragmented/zone_maybe_fragmented_in_shrinker (David
>     Hildenbrand)
> ---
>  include/linux/vmstat.h | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 3c9c266cf782..1ad48f70c9d9 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -483,6 +483,18 @@ static inline const char *zone_stat_name(enum
> zone_stat_item item)
>  	return vmstat_text[item];
>  }
>  

on the below heuristic, I was thinking of the following case: a large
memory system (say 16G, 32G), heavily fragmented (for whatever reason)
but constraint by the IOMMU requiring large pages due to hw alignment,
if I am not mistaken the below check will cause the shrinker to bail
out too 'early' since the there's plenty of available memory but none
of that is contiguous, then end result should be giving back small
pages which should reduce performance, right?

below are some made up numbers:


Metric          | 8GB               | 16GB
----------------|-------------------|-------------------
High Wmark      | ~45MB (11k pgs)   | ~90MB (23k pgs)
Bail Gate (2x)  | ~90MB (22k pgs)   | ~180MB (46k pgs)
Free RAM        | 120MB             | 7100MB (7.1GB)
Shrinker        | RUNS (Free<Gate)  | BAILS (Free>Gate)
Outcome         | Merges 2MB blocks | 4KB pages

In other words, replacing the check with numbers:

System       | Free RAM (Pages) | Gate (Pages) | Free < Gate? | Result
-------------|------------------|--------------|--------------|-------
8GB          | 20,480 (80MB)    | 22,946       | 20480 < 22946| RUNS
16GB         | 1,832,740 (7.1G) | 45,894       | 1.8M < 45k?  | BAILS


Carlos

> +static inline bool zone_maybe_fragmented_in_shrinker(struct zone
> *zone)
> +{
> +	/*
> +	 * Simple heuristic: if the number of free pages is more
> than twice the
> +	 * high watermark, this may suggest that the zone is heavily
> fragmented.
> +	 * When called from a shrinker, aggressively evicting memory
> in this case
> +	 * may do more harm to overall system performance than good.
> +	 */
> +	return zone_page_state(zone, NR_FREE_PAGES) >
> +		high_wmark_pages(zone) * 2;
> +}
> +
>  #ifdef CONFIG_NUMA
>  static inline const char *numa_stat_name(enum numa_stat_item item)
>  {


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation
  2026-04-30 19:18 [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost
                   ` (6 preceding siblings ...)
  2026-04-30 23:01 ` [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops " Andrew Morton
@ 2026-05-01  1:42 ` Dave Chinner
  2026-05-01  7:09   ` Matthew Brost
  2026-05-05  0:00 ` Claude review: " Claude Code Review Bot
  8 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2026-05-01  1:42 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, Dave Chinner, Qi Zheng, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Kairui Song, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Tvrtko Ursulin,
	Thomas Hellström, Carlos Santa, Christian Koenig, Huang Rui,
	Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, Daniel Colascione, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel

On Thu, Apr 30, 2026 at 12:18:03PM -0700, Matthew Brost wrote:
> TTM allocations at higher orders can drive Xe into a pathological
> reclaim loop when memory is fragmented:
> 
> kswapd → shrinker → eviction → rebind (exec ioctl) → repeat
> 
> In this state, reclaim is triggered despite substantial free memory,
> but fails to produce contiguous higher-order pages. The Xe shrinker then
> evicts active buffer objects, increasing faulting and rebind activity
> and further feeding the loop. The result is high CPU overhead and poor
> GPU forward progress.
> 
> This issue was first reported in [1] and independently observed
> internally and by Google.
> 
> A simple reproducer is:
> 
> - Boot an iGPU system with mem=8G
> - Launch 10 Chrome tabs running the WebGL aquarium demo
> - Configure each tab with ~5k fish
> 
> Under this workload, ftrace shows a continuous loop of:
> 
> xe_shrinker_scan (kswapd)
> xe_vma_rebind_exec
> 
> Performance degrades significantly, with each tab dropping to ~2 FPS on
> PTL (Ubuntu 24.04).
> 
> At the same time, /proc/buddyinfo shows substantial free memory but no
> higher-order availability. For example, the Normal zone:
> 
> Count: 4063 4595 3455 3400 3139 2762 2293 1655 643 0 0
> 
> This corresponds to ~2.8GB free memory, but no order-9 (2MB) blocks,
> indicating severe fragmentation.
> 
> This series addresses the issue in two ways:
> 
> TTM: Restrict direct reclaim to beneficial_order. Larger allocations
> use __GFP_NORETRY to fail quickly rather than triggering reclaim.

NACK.

As I have said to the people trying to hack around direct reclaim
for high order allocations being costly for the page cache, fix the
problem with direct reclaim. (e.g.
https://lore.kernel.org/linux-xfs/adLlrSZ5oRAa_Hfd@dread/)

We should not be hacking around a problem in the mm infrastructure
by changing allocation context flags every high order allocation 
call site that needs high order allocations. Understand and fix the
infrastructure problem once and for all.

> Xe: Introduce a heuristic in the shrinker to avoid eviction when
> running under kswapd and the system appears memory-rich but
> fragmented.

NACK on architectural grounds.

Custom heuristics in individual shrinkers to decide whether the
should do what the mm subsystem has asked them to do has -always-
been a mistake to allow. The mm subsystem makes the decision on how
much cache shrinkage needs to occur, the shrinkers just do what they
are told to do.

If we have a problem where a workload causes excessive shrinker
reclaim, then we need to address the problem in the infrastructure
because excessive reclaim affects the performance of -all-
subsystems with shrinkable caches, not just the TTM subsystem.

As it is, I can't review what you've actually implemented because
you only cc'd me on a single patch in the series. In future, please
cc me on the whole patchset because shrinkers need to work as a
coherent whole, not just in isolation....

-Dave.
-- 
Dave Chinner
dgc@kernel.org

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation
  2026-04-30 23:01 ` [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops " Andrew Morton
@ 2026-05-01  6:28   ` Matthew Brost
  2026-05-01 12:51     ` Andrew Morton
  0 siblings, 1 reply; 24+ messages in thread
From: Matthew Brost @ 2026-05-01  6:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: intel-xe, dri-devel, Dave Chinner, Qi Zheng, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Kairui Song, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Tvrtko Ursulin,
	Thomas Hellström, Carlos Santa, Christian Koenig, Huang Rui,
	Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, Daniel Colascione, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel

On Thu, Apr 30, 2026 at 04:01:05PM -0700, Andrew Morton wrote:
> On Thu, 30 Apr 2026 12:18:03 -0700 Matthew Brost <matthew.brost@intel.com> wrote:
> 
> > TTM allocations at higher orders can drive Xe into a pathological
> > reclaim loop when memory is fragmented:
> > 
> > kswapd → shrinker → eviction → rebind (exec ioctl) → repeat
> > 
> > In this state, reclaim is triggered despite substantial free memory,
> > but fails to produce contiguous higher-order pages. The Xe shrinker then
> > evicts active buffer objects, increasing faulting and rebind activity
> > and further feeding the loop. The result is high CPU overhead and poor
> > GPU forward progress.
> > 
> > ...
> >
> > This series addresses the issue in two ways:
> > 
> > TTM: Restrict direct reclaim to beneficial_order. Larger allocations
> > use __GFP_NORETRY to fail quickly rather than triggering reclaim.
> > 
> > Xe: Introduce a heuristic in the shrinker to avoid eviction when
> > running under kswapd and the system appears memory-rich but
> > fragmented.
> 
> Please cc everyone on all the patches?  It's kind of annoying to have
> to hunt around to find out how these proposed changes will be used. 
> Personal preference, anyway.
> 

Will do - we discussed this in the past and thought we landed on Cc
everyone on the cover then individual patches but will blast everyone
going forward.

> AI review flagged a few possible issues:
> 	https://sashiko.dev/#/patchset/20260430191809.2142544-1-matthew.brost@intel.com

Idk, who authors sashiko but what make it really nice if you could reply
to it to talk things out.

Looking at replies...

- 'Could this global counter drift significantly'
	this is looks right for multi-CPU which isn't really the target
	here, but will adjust

- 'Additionally, does NR_FREE_PAGES implicitly include CMA pages?'
	this is looks right, will adjust

- 'Can high_wmark_pages(zone) evaluate to zero during early boot'
	theoretically possible (?), but non-issue IMO, certainly a GPU
	shrinker which is current use case this is impossible but maybe
	add a warn_on if high_wmark_pages(zone) returns zero

- 'Is this description accurate?'
	I inverted the TTM kernel doc vs the code, will fix

Matt

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation
  2026-05-01  1:42 ` Dave Chinner
@ 2026-05-01  7:09   ` Matthew Brost
  0 siblings, 0 replies; 24+ messages in thread
From: Matthew Brost @ 2026-05-01  7:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: intel-xe, dri-devel, Dave Chinner, Qi Zheng, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Kairui Song, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Tvrtko Ursulin,
	Thomas Hellström, Carlos Santa, Christian Koenig, Huang Rui,
	Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, Daniel Colascione, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel

On Fri, May 01, 2026 at 11:42:19AM +1000, Dave Chinner wrote:

Thanks for the feedback. I’m looking into this more, and it’s becoming
clear that this is a hard problem—one that will likely require
coordinated work between DRM and core MM to really sort out. That said,
I do think what I have in place is a reasonable short-term fix.

More below.

> On Thu, Apr 30, 2026 at 12:18:03PM -0700, Matthew Brost wrote:
> > TTM allocations at higher orders can drive Xe into a pathological
> > reclaim loop when memory is fragmented:
> > 
> > kswapd → shrinker → eviction → rebind (exec ioctl) → repeat
> > 
> > In this state, reclaim is triggered despite substantial free memory,
> > but fails to produce contiguous higher-order pages. The Xe shrinker then
> > evicts active buffer objects, increasing faulting and rebind activity
> > and further feeding the loop. The result is high CPU overhead and poor
> > GPU forward progress.
> > 
> > This issue was first reported in [1] and independently observed
> > internally and by Google.
> > 
> > A simple reproducer is:
> > 
> > - Boot an iGPU system with mem=8G
> > - Launch 10 Chrome tabs running the WebGL aquarium demo
> > - Configure each tab with ~5k fish
> > 
> > Under this workload, ftrace shows a continuous loop of:
> > 
> > xe_shrinker_scan (kswapd)
> > xe_vma_rebind_exec
> > 
> > Performance degrades significantly, with each tab dropping to ~2 FPS on
> > PTL (Ubuntu 24.04).
> > 
> > At the same time, /proc/buddyinfo shows substantial free memory but no
> > higher-order availability. For example, the Normal zone:
> > 
> > Count: 4063 4595 3455 3400 3139 2762 2293 1655 643 0 0
> > 
> > This corresponds to ~2.8GB free memory, but no order-9 (2MB) blocks,
> > indicating severe fragmentation.
> > 
> > This series addresses the issue in two ways:
> > 
> > TTM: Restrict direct reclaim to beneficial_order. Larger allocations
> > use __GFP_NORETRY to fail quickly rather than triggering reclaim.
> 
> NACK.
> 
> As I have said to the people trying to hack around direct reclaim
> for high order allocations being costly for the page cache, fix the
> problem with direct reclaim. (e.g.
> https://lore.kernel.org/linux-xfs/adLlrSZ5oRAa_Hfd@dread/)
> 

I read your response. Maybe this isn't clear what is going here.

At beneficial_order: gfp == __GFP_RECLAIM | __GFP_NORETRY
At order zero: gfp == __GFP_RECLAIM

This roughly existing behavior, the exact changes are here [1].

[1] https://patchwork.freedesktop.org/patch/722247/?series=165329&rev=3

If this is truly a NACK, then we can rethink it—likely by disabling
reclaim at higher orders—but that has its own downsides for DRM and
GPUs. Ideally, you want purgeable BOs to be evicted when a higher-order
allocation fails; you really don’t want to end up in an insane kswap
loop.

> We should not be hacking around a problem in the mm infrastructure
> by changing allocation context flags every high order allocation 
> call site that needs high order allocations. Understand and fix the
> infrastructure problem once and for all.
> 

Well, I agree that we should aim to fix this in core MM, but as the
saying goes, Rome wasn’t built in a day. The fact is that these GFP
flags do exist, and suddenly drawing a line and declaring them no longer
valid feels a bit unfair. I’ll also note that Intel—and I
personally—have an interest in fixing shrinking, so you can expect
follow-up work here.

> > Xe: Introduce a heuristic in the shrinker to avoid eviction when
> > running under kswapd and the system appears memory-rich but
> > fragmented.
> 
> NACK on architectural grounds.
> 
> Custom heuristics in individual shrinkers to decide whether the
> should do what the mm subsystem has asked them to do has -always-
> been a mistake to allow. The mm subsystem makes the decision on how

I’m not going to disagree with using custom heuristics in individual
shrinkers, but I’d wager that most shrinkers sadly already implement
custom heuristics.

> much cache shrinkage needs to occur, the shrinkers just do what they
> are told to do.
> 
> If we have a problem where a workload causes excessive shrinker
> reclaim, then we need to address the problem in the infrastructure
> because excessive reclaim affects the performance of -all-
> subsystems with shrinkable caches, not just the TTM subsystem.
> 

Yes, I agree, and I’ve thought about the implications of simply having
TTM back off when a higher-order allocation fails, even when we actually
have enough memory, and how that would affect everyone. This series at
least fixes the “well, there goes my GUI” problem.

I do have another patch locally that prevents TTM from accidentally
fragmenting memory and triggering the kswap loop, but under enough
pressure I can still get the GUI to lock up for periods of time. With
this series, however, I can’t reproduce that issue.

> As it is, I can't review what you've actually implemented because
> you only cc'd me on a single patch in the series. In future, please
> cc me on the whole patchset because shrinkers need to work as a
> coherent whole, not just in isolation....
> 

Sorry about this - Andrew just said the same thing. Here is PW link [2].

Or:

b4 mbox 20260430191809.2142544-1-matthew.brost@intel.com

[2] https://patchwork.freedesktop.org/series/165329/

If you have any ideas on how to fix this in the core, let’s discuss. I
have a bunch of ideas in my head, but core MM isn’t my native domain.

Matt

> -Dave.
> -- 
> Dave Chinner
> dgc@kernel.org

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation
  2026-05-01  6:28   ` Matthew Brost
@ 2026-05-01 12:51     ` Andrew Morton
  0 siblings, 0 replies; 24+ messages in thread
From: Andrew Morton @ 2026-05-01 12:51 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, Dave Chinner, Qi Zheng, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Kairui Song, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Tvrtko Ursulin,
	Thomas Hellström, Carlos Santa, Christian Koenig, Huang Rui,
	Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, Daniel Colascione, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel

On Thu, 30 Apr 2026 23:28:08 -0700 Matthew Brost <matthew.brost@intel.com> wrote:

> > AI review flagged a few possible issues:
> > 	https://sashiko.dev/#/patchset/20260430191809.2142544-1-matthew.brost@intel.com
> 
> Idk, who authors sashiko but what make it really nice if you could reply
> to it to talk things out.

It's a gemini 3 thing, based on prompts developed by Roman
Gushchin and Chris Mason and others.  Google is making this available
to kernel developers at a non-trivial expense.

And yes, it would be great if Sashiko were able to learn from our
replies and to fine-tune its checking based on the human corrections. 
I've asked for this a few times but didn't really understand the reply
;)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation
       [not found]   ` <f25f27a1-bf09-44bd-9b37-49f159d82d6a@panix.com>
@ 2026-05-01 20:00     ` Matthew Brost
       [not found]       ` <1bc0b1a7-a01f-4dc2-ad7a-3a05f975331e@panix.com>
  0 siblings, 1 reply; 24+ messages in thread
From: Matthew Brost @ 2026-05-01 20:00 UTC (permalink / raw)
  To: Kenneth Crudup, airlied
  Cc: intel-xe, dri-devel, Thomas Hellström, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel

On Fri, May 01, 2026 at 12:08:48PM -0700, Kenneth Crudup wrote:
> 
> On 4/30/26 12:18, Matthew Brost wrote:
> 
> > Introduce zone_maybe_fragmented_in_shrinker() as a lightweight helper to
> > allow subsystems to make coarse decisions about reclaim behavior in the
> > presence of likely fragmentation
> 
> I'm running Linus' master on my LunarLake (258v) laptop, and sometimes after

+Dave

So is this 7.1-rc1? It looks like new feature to 7.1 added by Dave [1] and
something look off here. Thanks for pointing this out.

I'm grabbing a machine now to see if I can recreate this...

Matt

[1] git format-patch -1 2232ba9c7931d

> compiling a kernel (of all things) I'd see kswapd0 thrash despite having
> quite a bit of free memory.
> 
> I finally traced it to the xe driver after seeing the "GPUActive" field in
> /proc/meminfo suddenly start rising, eventually growing larger than real
> memory by several times (see below).
> 
> This patchset fixes the issue, and I'm sure there'll be a fix going into
> Linus' master soon, but what I'M wondering is how could building a kernel
> (which is just in a KDE Konsole running on Wayland) make the GPActive grow
> from ~1.6G to > 30G (and continue to rise, RN I'm seeing 91839848 kBs and
> still growing).
> 
> -Kenny
> 
> ----
> SwapTotal:      33554428 kB
> MemTotal:       32345672 kB
> GPUActive:        652640 kB
> GPUReclaim:       403988 kB
> 
> SwapTotal:      33554428 kB
> MemTotal:       32345672 kB
> GPUActive:        651180 kB
> GPUReclaim:       406812 kB
> 
> SwapTotal:      33554428 kB
> MemTotal:       32345672 kB
> GPUActive:        659004 kB
> GPUReclaim:       399396 kB
> 
> SwapTotal:      33554428 kB
> MemTotal:       32345672 kB
> GPUActive:        666996 kB
> GPUReclaim:       392764 kB
> 
> <some hours later>
> GPUActive:      91832468 kB
> SwapTotal:      33554428 kB
> MemTotal:       32345672 kB
> GPUReclaim:       488000 kB
> 
> GPUActive:      91832332 kB
> SwapTotal:      33554428 kB
> MemTotal:       32345672 kB
> GPUReclaim:       487988 kB
> 
> GPUActive:      91869376 kB
> SwapTotal:      33554428 kB
> MemTotal:       32345672 kB
> GPUReclaim:       486504 kB
> ----
> 
> -- 
> Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County
> CA
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation
       [not found]       ` <1bc0b1a7-a01f-4dc2-ad7a-3a05f975331e@panix.com>
@ 2026-05-01 21:10         ` Matthew Brost
  2026-05-01 22:33           ` Matthew Brost
  0 siblings, 1 reply; 24+ messages in thread
From: Matthew Brost @ 2026-05-01 21:10 UTC (permalink / raw)
  To: Kenneth Crudup
  Cc: airlied, intel-xe, dri-devel, Thomas Hellström,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel

On Fri, May 01, 2026 at 01:05:57PM -0700, Kenneth Crudup wrote:
> 
> On 5/1/26 13:00, Matthew Brost wrote:
> 
> > So is this 7.1-rc1? It looks like new feature to 7.1 added by Dave [1] and
> > something look off here. Thanks for pointing this out.
> 
> Yeah. I grab his master branch daily (as of 6fe0be6dc7fa RN).
> 
> Is this a "shoot the messenger" thing? IOW, is the reporting off, or is the

I don't think I'm firing any shots.

> memory usage really that high?

I've been able to recreate this. It looks like accounting is correct
until the Xe shrinker runs - every time it kicks in GPUActive grows and
will not reduce past some new floor value. It looks like an accounting
bug in TTM or Xe (?).

Here is my output on a 8G PTL where I have intentionally triggered
shrinker to evict at least 23875 BOs (most likey quite few more but this
what I easily see in dmesg) after closing everything on desktop.

cat /proc/meminfo | grep GPU; cat /proc/buddyinfo;
GPUActive:      13100036 kB
GPUReclaim:          152 kB
Node 0, zone      DMA      0      1      0      0      0      0      0      0      1      1      3
Node 0, zone    DMA32   2320   1882   1523   1238    980    740    482    275    114     88    205
Node 0, zone   Normal   9751   9343   6466   4237   2703   1162    805    420    191    145    289

Let me spend a bit of time here to see if I figure out where the
accounting goes wrong.

Matt

> 
> (BTW, those are in 30-second intervals)
> 
> > > ----
> > > SwapTotal:      33554428 kB
> > > MemTotal:       32345672 kB
> > > GPUActive:        652640 kB
> > > GPUReclaim:       403988 kB
> > > 
> > > SwapTotal:      33554428 kB
> > > MemTotal:       32345672 kB
> > > GPUActive:        651180 kB
> > > GPUReclaim:       406812 kB
> > > 
> > > SwapTotal:      33554428 kB
> > > MemTotal:       32345672 kB
> > > GPUActive:        659004 kB
> > > GPUReclaim:       399396 kB
> > > 
> > > SwapTotal:      33554428 kB
> > > MemTotal:       32345672 kB
> > > GPUActive:        666996 kB
> > > GPUReclaim:       392764 kB
> > > 
> > > <some hours later>
> > > GPUActive:      91832468 kB
> > > SwapTotal:      33554428 kB
> > > MemTotal:       32345672 kB
> > > GPUReclaim:       488000 kB
> > > 
> > > GPUActive:      91832332 kB
> > > SwapTotal:      33554428 kB
> > > MemTotal:       32345672 kB
> > > GPUReclaim:       487988 kB
> > > 
> > > GPUActive:      91869376 kB
> > > SwapTotal:      33554428 kB
> > > MemTotal:       32345672 kB
> > > GPUReclaim:       486504 kB
> > > ----
> 
> -K
> 
> -- 
> Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County
> CA
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation
  2026-05-01 21:10         ` Matthew Brost
@ 2026-05-01 22:33           ` Matthew Brost
  0 siblings, 0 replies; 24+ messages in thread
From: Matthew Brost @ 2026-05-01 22:33 UTC (permalink / raw)
  To: Kenneth Crudup
  Cc: airlied, intel-xe, dri-devel, Thomas Hellström,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel

On Fri, May 01, 2026 at 02:10:07PM -0700, Matthew Brost wrote:
> On Fri, May 01, 2026 at 01:05:57PM -0700, Kenneth Crudup wrote:
> > 
> > On 5/1/26 13:00, Matthew Brost wrote:
> > 
> > > So is this 7.1-rc1? It looks like new feature to 7.1 added by Dave [1] and
> > > something look off here. Thanks for pointing this out.
> > 
> > Yeah. I grab his master branch daily (as of 6fe0be6dc7fa RN).
> > 
> > Is this a "shoot the messenger" thing? IOW, is the reporting off, or is the
> 
> I don't think I'm firing any shots.
> 
> > memory usage really that high?
> 
> I've been able to recreate this. It looks like accounting is correct
> until the Xe shrinker runs - every time it kicks in GPUActive grows and
> will not reduce past some new floor value. It looks like an accounting
> bug in TTM or Xe (?).
> 
> Here is my output on a 8G PTL where I have intentionally triggered
> shrinker to evict at least 23875 BOs (most likey quite few more but this
> what I easily see in dmesg) after closing everything on desktop.
> 
> cat /proc/meminfo | grep GPU; cat /proc/buddyinfo;
> GPUActive:      13100036 kB
> GPUReclaim:          152 kB
> Node 0, zone      DMA      0      1      0      0      0      0      0      0      1      1      3
> Node 0, zone    DMA32   2320   1882   1523   1238    980    740    482    275    114     88    205
> Node 0, zone   Normal   9751   9343   6466   4237   2703   1162    805    420    191    145    289
> 
> Let me spend a bit of time here to see if I figure out where the
> accounting goes wrong.
> 

Looks like a simple accounting error in the shrinking path. Here is a fix
[1] that seems to work for me.

If you want to give a it try, that would be helpful.

Matt
 
[1] https://patchwork.freedesktop.org/series/165862/

> Matt
> 
> > 
> > (BTW, those are in 30-second intervals)
> > 
> > > > ----
> > > > SwapTotal:      33554428 kB
> > > > MemTotal:       32345672 kB
> > > > GPUActive:        652640 kB
> > > > GPUReclaim:       403988 kB
> > > > 
> > > > SwapTotal:      33554428 kB
> > > > MemTotal:       32345672 kB
> > > > GPUActive:        651180 kB
> > > > GPUReclaim:       406812 kB
> > > > 
> > > > SwapTotal:      33554428 kB
> > > > MemTotal:       32345672 kB
> > > > GPUActive:        659004 kB
> > > > GPUReclaim:       399396 kB
> > > > 
> > > > SwapTotal:      33554428 kB
> > > > MemTotal:       32345672 kB
> > > > GPUActive:        666996 kB
> > > > GPUReclaim:       392764 kB
> > > > 
> > > > <some hours later>
> > > > GPUActive:      91832468 kB
> > > > SwapTotal:      33554428 kB
> > > > MemTotal:       32345672 kB
> > > > GPUReclaim:       488000 kB
> > > > 
> > > > GPUActive:      91832332 kB
> > > > SwapTotal:      33554428 kB
> > > > MemTotal:       32345672 kB
> > > > GPUReclaim:       487988 kB
> > > > 
> > > > GPUActive:      91869376 kB
> > > > SwapTotal:      33554428 kB
> > > > MemTotal:       32345672 kB
> > > > GPUReclaim:       486504 kB
> > > > ----
> > 
> > -K
> > 
> > -- 
> > Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County
> > CA
> > 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 3/6] drm/ttm: Issue direct reclaim at beneficial_order
  2026-04-30 19:18 ` [PATCH v4 3/6] drm/ttm: Issue direct reclaim at beneficial_order Matthew Brost
@ 2026-05-04  7:16   ` Christian König
  2026-05-05  0:00   ` Claude review: " Claude Code Review Bot
  1 sibling, 0 replies; 24+ messages in thread
From: Christian König @ 2026-05-04  7:16 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: Tvrtko Ursulin, Thomas Hellström, Carlos Santa, Huang Rui,
	Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, Daniel Colascione, Andi Shyti

On 4/30/26 21:18, Matthew Brost wrote:
> Triggering kswap at an order higher than beneficial_order makes little
> sense, as the driver has already indicated the optimal order at which
> reclaim is effective. Similarly, issuing direct reclaim or triggering
> kswap at a lower order than beneficial_order is ineffective, since the
> driver does not benefit from reclaiming lower-order pages.
> 
> As a result, direct reclaim should only be issued with __GFP_NORETRY at
> exactly beneficial_order, or as a fallback, direct reclaim without
> __GFP_NORETRY at order 0 when failure is not an option.
> 
> Cc: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Carlos Santa <carlos.santa@intel.com>
> Cc: Christian Koenig <christian.koenig@amd.com>
> Cc: Huang Rui <ray.huang@amd.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Maxime Ripard <mripard@kernel.org>
> Cc: Thomas Zimmermann <tzimmermann@suse.de>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> CC: dri-devel@lists.freedesktop.org
> Cc: Daniel Colascione <dancol@dancol.org>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Reviewed-by: Christian Koenig <christian.koenig@amd.com>
> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
> ---
>  drivers/gpu/drm/ttm/ttm_pool.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> index 26a3689e5fd9..8425dbcc6c68 100644
> --- a/drivers/gpu/drm/ttm/ttm_pool.c
> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> @@ -165,8 +165,8 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
>  	 * Do not add latency to the allocation path for allocations orders
>  	 * device tolds us do not bring them additional performance gains.
>  	 */
> -	if (beneficial_order && order > beneficial_order)
> -		gfp_flags &= ~__GFP_DIRECT_RECLAIM;
> +	if (order && beneficial_order && order != beneficial_order)
> +		gfp_flags &= ~__GFP_RECLAIM;

Just one additional nit: The subject line says "direct reclaim" but we disable both direct as well as kswapd based reclaim.

Potentially just update the subject line.

Regards,
Christian.

>  
>  	if (!ttm_pool_uses_dma_alloc(pool)) {
>  		p = alloc_pages_node(pool->nid, gfp_flags, order);


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Claude review: mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation
  2026-04-30 19:18 [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost
                   ` (7 preceding siblings ...)
  2026-05-01  1:42 ` Dave Chinner
@ 2026-05-05  0:00 ` Claude Code Review Bot
  8 siblings, 0 replies; 24+ messages in thread
From: Claude Code Review Bot @ 2026-05-05  0:00 UTC (permalink / raw)
  To: dri-devel-reviews

Overall Series Review

Subject: mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation
Author: Matthew Brost <matthew.brost@intel.com>
Patches: 14
Reviewed: 2026-05-05T10:00:00.215675

---

This is a 6-patch series (v4) addressing a real and well-documented pathological reclaim loop in Xe GPU systems under memory fragmentation. The problem is clearly described: kswapd drives shrinker eviction of GPU BOs, which causes rebinds, which trigger more allocation pressure, in a tight loop — despite substantial free memory being available (just fragmented).

The approach is layered sensibly: two MM core patches (wiring order into shrink_control, adding a fragmentation heuristic), two TTM patches (beneficial_order reclaim changes, a TTM helper for fragmentation detection), and two Xe patches (setting the order, using the helper to bail out).

**Overall assessment: The series is well-motivated and the approach is reasonable. However, there are substantive concerns about patch 3's GFP flag changes and some issues with the fragmentation heuristic's placement and logic.**

---

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Claude review: mm: Wire up order in shrink_control
  2026-04-30 19:18 ` [PATCH v4 1/6] mm: Wire up order in shrink_control Matthew Brost
@ 2026-05-05  0:00   ` Claude Code Review Bot
  0 siblings, 0 replies; 24+ messages in thread
From: Claude Code Review Bot @ 2026-05-05  0:00 UTC (permalink / raw)
  To: dri-devel-reviews

Patch Review

**Subject:** `[PATCH v4 1/6] mm: Wire up order in shrink_control`

This patch adds an `s8 order` field to `struct shrink_control` and threads the allocation order through `shrink_slab()` from `scan_control.order` to shrinkers.

**Looks correct.** The field is added cleanly:
```c
+	/* Allocation order we are currently trying to fulfil. */
+	s8 order;
```

The `shrink_slab` signature change is straightforward, threading `order` from `sc->order` at call sites in vmscan.c:
```c
-	shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
+	shrink_slab(sc->gfp_mask, pgdat->node_id, sc->order, memcg,
+		    sc->priority);
```

The `drop_slab_node()` call correctly passes `0` as order since it's not driven by a specific allocation.

One minor nit: the `s8` type is a good match since `scan_control.order` is already `s8`, but it would be worth a brief justification in the commit message for why `s8` over `int` (consistency with scan_control). This is cosmetic.

**No functional issues. This patch is fine.**

---

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Claude review: mm: Introduce zone_maybe_fragmented_in_shrinker()
  2026-04-30 19:18 ` [PATCH v4 2/6] mm: Introduce zone_maybe_fragmented_in_shrinker() Matthew Brost
  2026-05-01  0:50   ` Santa, Carlos
       [not found]   ` <f25f27a1-bf09-44bd-9b37-49f159d82d6a@panix.com>
@ 2026-05-05  0:00   ` Claude Code Review Bot
  2 siblings, 0 replies; 24+ messages in thread
From: Claude Code Review Bot @ 2026-05-05  0:00 UTC (permalink / raw)
  To: dri-devel-reviews

Patch Review

**Subject:** `[PATCH v4 2/6] mm: Introduce zone_maybe_fragmented_in_shrinker()`

Adds a static inline heuristic to `include/linux/vmstat.h`:
```c
+static inline bool zone_maybe_fragmented_in_shrinker(struct zone *zone)
+{
+	return zone_page_state(zone, NR_FREE_PAGES) >
+		high_wmark_pages(zone) * 2;
+}
```

**Concerns:**

1. **The 2x watermark threshold is arbitrary and undocumented.** The commit message says "intentionally imprecise", which is fair, but the factor of 2 has no empirical justification in the commit message. It would be useful to explain why 2x specifically — was this tuned experimentally? Different systems have very different watermark configurations.

2. **Overflow potential with `high_wmark_pages(zone) * 2`.** `high_wmark_pages()` returns `unsigned long`. On 32-bit systems with large memory zones (unlikely but theoretically possible), `* 2` could overflow. This is very unlikely to matter in practice but a `2UL *` would be safer.

3. **Placement in vmstat.h.** This is a policy heuristic, not a stat accessor. `vmstat.h` is about statistics, not reclaim policy decisions. MM maintainers may prefer this in a header more related to reclaim (e.g., `mm.h` or a shrinker-specific header). Given the Cc list includes multiple MM maintainers, they may weigh in on placement.

4. **Name is good** — the `_in_shrinker` suffix and `_maybe_` qualifier set appropriate expectations.

---

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Claude review: drm/ttm: Issue direct reclaim at beneficial_order
  2026-04-30 19:18 ` [PATCH v4 3/6] drm/ttm: Issue direct reclaim at beneficial_order Matthew Brost
  2026-05-04  7:16   ` Christian König
@ 2026-05-05  0:00   ` Claude Code Review Bot
  1 sibling, 0 replies; 24+ messages in thread
From: Claude Code Review Bot @ 2026-05-05  0:00 UTC (permalink / raw)
  To: dri-devel-reviews

Patch Review

**Subject:** `[PATCH v4 3/6] drm/ttm: Issue direct reclaim at beneficial_order`

This patch changes how GFP flags are modified in `ttm_pool_alloc_page()`:

```c
-	if (beneficial_order && order > beneficial_order)
-		gfp_flags &= ~__GFP_DIRECT_RECLAIM;
+	if (order && beneficial_order && order != beneficial_order)
+		gfp_flags &= ~__GFP_RECLAIM;
```

**This patch has significant issues:**

1. **The change from `__GFP_DIRECT_RECLAIM` to `__GFP_RECLAIM` is aggressive.** `__GFP_RECLAIM` is `(__GFP_DIRECT_RECLAIM | __GFP_KSWAPD_RECLAIM)`. The old code only removed direct reclaim for orders above beneficial_order, allowing kswapd to still be woken. The new code removes *all* reclaim (including kswapd wakeup) for any order that isn't exactly `beneficial_order` (and isn't 0). This means:
   - Orders *below* `beneficial_order` (e.g., order 1-8 when beneficial is 9) now lose **all** reclaim, not just direct reclaim. Before, these weren't affected at all.
   - Orders *above* `beneficial_order` also lose kswapd wakeup, which they didn't before.

2. **The `order != beneficial_order` condition is too broad.** The old code only penalized orders *above* beneficial_order. The new code penalizes all non-zero orders that aren't exactly equal to beneficial_order. For orders between 1 and `beneficial_order - 1`, the old code allowed normal reclaim behavior. The commit message says "issuing direct reclaim at a lower order than beneficial_order is ineffective" — but that's the driver's preferred large page size. The allocator will still try smaller orders as fallbacks (that's how the pool works — it iterates downward), and those smaller allocations should still be allowed to trigger reclaim since they may succeed where the large order failed.

3. **The commit message is misleading.** It says "direct reclaim should only be issued with `__GFP_NORETRY` at exactly beneficial_order" but the code doesn't set `__GFP_NORETRY` — it strips `__GFP_RECLAIM`. These are very different semantics. Also, `__GFP_NORETRY` is already set at line 161 for all non-zero orders:
   ```c
   if (order)
       gfp_flags |= __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN |
           __GFP_THISNODE;
   ```
   So the description of intended behavior doesn't match the implementation.

4. **The fallback to order 0 claim in the commit message** ("direct reclaim without `__GFP_NORETRY` at order 0 when failure is not an option") is handled by the `if (order && ...)` guard, which is correct.

**Recommendation: This patch needs rework.** The semantics change is too aggressive. Consider keeping `__GFP_DIRECT_RECLAIM` removal (not `__GFP_RECLAIM`) and being more precise about the order range.

---

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Claude review: drm/ttm: Introduce ttm_bo_shrink_kswap_maybe_fragmented()
  2026-04-30 19:18 ` [PATCH v4 4/6] drm/ttm: Introduce ttm_bo_shrink_kswap_maybe_fragmented() Matthew Brost
@ 2026-05-05  0:00   ` Claude Code Review Bot
  0 siblings, 0 replies; 24+ messages in thread
From: Claude Code Review Bot @ 2026-05-05  0:00 UTC (permalink / raw)
  To: dri-devel-reviews

Patch Review

**Subject:** `[PATCH v4 4/6] drm/ttm: Introduce ttm_bo_shrink_kswap_maybe_fragmented()`

Adds a helper in `ttm_bo_util.c`:

```c
+bool ttm_bo_shrink_kswap_maybe_fragmented(int nid, s8 order)
+{
+	if (!order)
+		return false;
+
+	if (!current_is_kswapd())
+		return false;
+
+	if (!numa_valid_node(nid))
+		return false;
+
+#if IS_ENABLED(CONFIG_ZONE_DMA32)
+	zone_type = ZONE_DMA32;
+#else
+	zone_type = ZONE_NORMAL;
+#endif
+
+	for (; zone_type <= ZONE_NORMAL; ++zone_type) {
+		struct zone *zone = &NODE_DATA(nid)->node_zones[zone_type];
+		if (zone_maybe_fragmented_in_shrinker(zone))
+			return true;
+	}
+	return false;
+}
```

**Concerns:**

1. **The zone iteration logic is confusing.** When `CONFIG_ZONE_DMA32` is enabled, it iterates from `ZONE_DMA32` to `ZONE_NORMAL`. When it's disabled, it only checks `ZONE_NORMAL` (since `zone_type = ZONE_NORMAL` and the loop condition is `<= ZONE_NORMAL`). This is correct but the `#if`/`#else` makes it harder to read than necessary. A comment explaining the iteration range would help.

2. **`ZONE_DMA` is excluded.** The helper only checks DMA32 and NORMAL zones. This is probably intentional (GPU memory typically comes from these zones) but isn't documented.

3. **Return semantics are good.** The early returns for `!order` (order-0 allocations shouldn't trigger fragmentation avoidance), `!current_is_kswapd()` (only affects background reclaim), and `!numa_valid_node(nid)` are sensible guards.

4. **Docstring typo:** "false is not" should be "false if not".
   ```c
    * Return: true if in kswap and memory appears fragmented, false is not.
   ```

5. **The `order` parameter is `s8` but only tested for non-zero.** The actual order value isn't used beyond the zero check. If the intent is only "was this a higher-order allocation?", a bool would be clearer. However, passing `order` through is forward-compatible for future refinements that may want the actual value.

**Minor issues only. Functionally sound.**

---

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Claude review: drm/xe: Set TTM device beneficial_order to 9 (2M)
  2026-04-30 19:18 ` [PATCH v4 5/6] drm/xe: Set TTM device beneficial_order to 9 (2M) Matthew Brost
@ 2026-05-05  0:00   ` Claude Code Review Bot
  0 siblings, 0 replies; 24+ messages in thread
From: Claude Code Review Bot @ 2026-05-05  0:00 UTC (permalink / raw)
  To: dri-devel-reviews

Patch Review

**Subject:** `[PATCH v4 5/6] drm/xe: Set TTM device beneficial_order to 9 (2M)`

```c
-			      xe->drm.vma_offset_manager, 0);
+			      xe->drm.vma_offset_manager,
+			      TTM_ALLOCATION_POOL_BENEFICIAL_ORDER(get_order(SZ_2M)));
```

**This is straightforward and correct.** `get_order(SZ_2M)` returns 9 on 4K-page systems. The `TTM_ALLOCATION_POOL_BENEFICIAL_ORDER` macro packs this into the `alloc_flags` parameter. Has Andi's R-b already.

One observation: on architectures with non-4K base page sizes (e.g., 64K pages on arm64), `get_order(SZ_2M)` would return a different value (e.g., order 5 for 64K pages). The commit message says "9 (2M)" but the code uses `get_order(SZ_2M)`, which is the correct portable approach. The commit message should perhaps say "order matching 2M" rather than hardcoding "9".

**No functional issues.**

---

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Claude review: drm/xe: Avoid shrinker reclaim from kswapd under fragmentation
  2026-04-30 19:18 ` [PATCH v4 6/6] drm/xe: Avoid shrinker reclaim from kswapd under fragmentation Matthew Brost
@ 2026-05-05  0:00   ` Claude Code Review Bot
  0 siblings, 0 replies; 24+ messages in thread
From: Claude Code Review Bot @ 2026-05-05  0:00 UTC (permalink / raw)
  To: dri-devel-reviews

Patch Review

**Subject:** `[PATCH v4 6/6] drm/xe: Avoid shrinker reclaim from kswapd under fragmentation`

```c
+	if (ttm_bo_shrink_kswap_maybe_fragmented(sc->nid, sc->order))
+		goto out;
```

**Placement concern:** This check is inserted at line 238 in `xe_shrinker_scan`, between the purgeable scan (phase 1) and the backup/writeback scan (phase 2):

```c
	if (nr_scanned >= nr_to_scan || !can_backup)
		goto out;

+	if (ttm_bo_shrink_kswap_maybe_fragmented(sc->nid, sc->order))
+		goto out;

	/* If we didn't wake before, try to do it now if needed. */
	if (!runtime_pm)
		runtime_pm = xe_shrinker_runtime_pm_get(shrinker, true, 0, can_backup);
```

**This placement is well-chosen.** It allows purgeable BOs to still be freed (those are cheap to reclaim — no writeback needed, just discarded). It only blocks the more expensive backup/writeback path where active GPU working set would be evicted to swap/shmem. This is the right tradeoff: during fragmentation-driven kswapd, purge expendable objects but don't evict active working sets.

**However**, the `goto out` means `sc->nr_scanned` retains the value from the purgeable scan, and the return is `nr_scanned ? freed : SHRINK_STOP`. If the purgeable scan found nothing to scan (`nr_scanned == 0`) and we hit the fragmentation bail-out, we'll return `SHRINK_STOP`, which tells the MM "this shrinker has nothing to offer." That's actually the desired behavior here — if we have nothing purgeable and we're under fragmentation, we genuinely don't want kswapd to keep hammering this shrinker.

**No functional issues. Good integration.**

---

## Summary

The series addresses a real problem with a reasonable layered approach. **Patch 3 is the main concern** — the change from `__GFP_DIRECT_RECLAIM` to `__GFP_RECLAIM` and the broadened condition from `order > beneficial_order` to `order != beneficial_order` are both more aggressive than justified in the commit message, and the commit message itself is inconsistent with the implementation. The remaining patches are solid, with minor documentation and style nits in patches 2 and 4.

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2026-05-05  0:00 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-30 19:18 [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost
2026-04-30 19:18 ` [PATCH v4 1/6] mm: Wire up order in shrink_control Matthew Brost
2026-05-05  0:00   ` Claude review: " Claude Code Review Bot
2026-04-30 19:18 ` [PATCH v4 2/6] mm: Introduce zone_maybe_fragmented_in_shrinker() Matthew Brost
2026-05-01  0:50   ` Santa, Carlos
     [not found]   ` <f25f27a1-bf09-44bd-9b37-49f159d82d6a@panix.com>
2026-05-01 20:00     ` PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost
     [not found]       ` <1bc0b1a7-a01f-4dc2-ad7a-3a05f975331e@panix.com>
2026-05-01 21:10         ` Matthew Brost
2026-05-01 22:33           ` Matthew Brost
2026-05-05  0:00   ` Claude review: mm: Introduce zone_maybe_fragmented_in_shrinker() Claude Code Review Bot
2026-04-30 19:18 ` [PATCH v4 3/6] drm/ttm: Issue direct reclaim at beneficial_order Matthew Brost
2026-05-04  7:16   ` Christian König
2026-05-05  0:00   ` Claude review: " Claude Code Review Bot
2026-04-30 19:18 ` [PATCH v4 4/6] drm/ttm: Introduce ttm_bo_shrink_kswap_maybe_fragmented() Matthew Brost
2026-05-05  0:00   ` Claude review: " Claude Code Review Bot
2026-04-30 19:18 ` [PATCH v4 5/6] drm/xe: Set TTM device beneficial_order to 9 (2M) Matthew Brost
2026-05-05  0:00   ` Claude review: " Claude Code Review Bot
2026-04-30 19:18 ` [PATCH v4 6/6] drm/xe: Avoid shrinker reclaim from kswapd under fragmentation Matthew Brost
2026-05-05  0:00   ` Claude review: " Claude Code Review Bot
2026-04-30 23:01 ` [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops " Andrew Morton
2026-05-01  6:28   ` Matthew Brost
2026-05-01 12:51     ` Andrew Morton
2026-05-01  1:42 ` Dave Chinner
2026-05-01  7:09   ` Matthew Brost
2026-05-05  0:00 ` Claude review: " Claude Code Review Bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox