* Re: [PATCH] drm/ttm: Fix ttm_bo_shrink() infinite LRU walk on backup failure
2026-05-11 16:24 [PATCH] drm/ttm: Fix ttm_bo_shrink() infinite LRU walk on backup failure Thomas Hellström
@ 2026-05-12 13:30 ` Matthew Auld
2026-05-13 7:20 ` kernel test robot
` (3 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Matthew Auld @ 2026-05-12 13:30 UTC (permalink / raw)
To: Thomas Hellström, intel-xe
Cc: Christian König, Huang Rui, Matthew Brost, Dave Airlie,
dri-devel, stable
On 11/05/2026 17:24, Thomas Hellström wrote:
> Apply the same fix as b2ed01e7ad ("drm/ttm: Fix ttm_bo_swapout()
> infinite LRU walk on swapout failure") to the ttm_bo_shrink() path.
>
> Move del_bulk_move from before the backup to after success only,
> using ttm_resource_del_bulk_move_unevictable() since the resource
> is now unevictable once fully backed up.
>
> Fixes: 70d645deac98 ("drm/ttm: Add helpers for shrinking")
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Huang Rui <ray.huang@amd.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Dave Airlie <airlied@redhat.com>
> Cc: dri-devel@lists.freedesktop.org
> Cc: <stable@vger.kernel.org> # v6.15+
> Assisted-by: GitHub_Copilot:claude-opus-4.6
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
> ---
> drivers/gpu/drm/ttm/ttm_bo_util.c | 11 +++--------
> 1 file changed, 3 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c
> index f83b7d5ec6c6..3e3c201a0222 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo_util.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
> @@ -1112,19 +1112,14 @@ long ttm_bo_shrink(struct ttm_operation_ctx *ctx, struct ttm_buffer_object *bo,
> if (lret < 0)
> return lret;
>
> - if (bo->bulk_move) {
> - spin_lock(&bdev->lru_lock);
> - ttm_resource_del_bulk_move(bo->resource, bo);
> - spin_unlock(&bdev->lru_lock);
> - }
> -
> lret = ttm_tt_backup(bdev, bo->ttm, (struct ttm_backup_flags)
> {.purge = flags.purge,
> .writeback = flags.writeback});
>
> - if (lret <= 0 && bo->bulk_move) {
> + if (lret > 0) {
> spin_lock(&bdev->lru_lock);
> - ttm_resource_add_bulk_move(bo->resource, bo);
> + ttm_resource_del_bulk_move_unevictable(bo->resource, bo);
> + ttm_resource_move_to_lru_tail(bo->resource);
> spin_unlock(&bdev->lru_lock);
> }
>
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: [PATCH] drm/ttm: Fix ttm_bo_shrink() infinite LRU walk on backup failure
2026-05-11 16:24 [PATCH] drm/ttm: Fix ttm_bo_shrink() infinite LRU walk on backup failure Thomas Hellström
2026-05-12 13:30 ` Matthew Auld
@ 2026-05-13 7:20 ` kernel test robot
2026-05-13 10:24 ` kernel test robot
` (2 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: kernel test robot @ 2026-05-13 7:20 UTC (permalink / raw)
To: Thomas Hellström, intel-xe
Cc: oe-kbuild-all, Thomas Hellström, Christian König,
Huang Rui, Matthew Auld, Matthew Brost, Dave Airlie, dri-devel,
stable
Hi Thomas,
kernel test robot noticed the following build errors:
[auto build test ERROR on drm-misc/drm-misc-next]
[also build test ERROR on linus/master v7.1-rc3 next-20260508]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Thomas-Hellstr-m/drm-ttm-Fix-ttm_bo_shrink-infinite-LRU-walk-on-backup-failure/20260513-095356
base: https://gitlab.freedesktop.org/drm/misc/kernel.git drm-misc-next
patch link: https://lore.kernel.org/r/20260511162443.24352-1-thomas.hellstrom%40linux.intel.com
patch subject: [PATCH] drm/ttm: Fix ttm_bo_shrink() infinite LRU walk on backup failure
config: powerpc-allmodconfig (https://download.01.org/0day-ci/archive/20260513/202605131522.yUSpVs9Q-lkp@intel.com/config)
compiler: powerpc64-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260513/202605131522.yUSpVs9Q-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605131522.yUSpVs9Q-lkp@intel.com/
All errors (new ones prefixed by >>):
drivers/gpu/drm/ttm/ttm_bo_util.c: In function 'ttm_bo_shrink':
>> drivers/gpu/drm/ttm/ttm_bo_util.c:1121:17: error: implicit declaration of function 'ttm_resource_del_bulk_move_unevictable'; did you mean 'ttm_resource_del_bulk_move'? [-Wimplicit-function-declaration]
1121 | ttm_resource_del_bulk_move_unevictable(bo->resource, bo);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| ttm_resource_del_bulk_move
vim +1121 drivers/gpu/drm/ttm/ttm_bo_util.c
1067
1068 /**
1069 * ttm_bo_shrink() - Helper to shrink a ttm buffer object.
1070 * @ctx: The struct ttm_operation_ctx used for the shrinking operation.
1071 * @bo: The buffer object.
1072 * @flags: Flags governing the shrinking behaviour.
1073 *
1074 * The function uses the ttm_tt_back_up functionality to back up or
1075 * purge a struct ttm_tt. If the bo is not in system, it's first
1076 * moved there.
1077 *
1078 * Return: The number of pages shrunken or purged, or
1079 * negative error code on failure.
1080 */
1081 long ttm_bo_shrink(struct ttm_operation_ctx *ctx, struct ttm_buffer_object *bo,
1082 const struct ttm_bo_shrink_flags flags)
1083 {
1084 static const struct ttm_place sys_placement_flags = {
1085 .fpfn = 0,
1086 .lpfn = 0,
1087 .mem_type = TTM_PL_SYSTEM,
1088 .flags = 0,
1089 };
1090 static struct ttm_placement sys_placement = {
1091 .num_placement = 1,
1092 .placement = &sys_placement_flags,
1093 };
1094 struct ttm_device *bdev = bo->bdev;
1095 long lret;
1096
1097 dma_resv_assert_held(bo->base.resv);
1098
1099 if (flags.allow_move && bo->resource->mem_type != TTM_PL_SYSTEM) {
1100 int ret = ttm_bo_validate(bo, &sys_placement, ctx);
1101
1102 /* Consider -ENOMEM and -ENOSPC non-fatal. */
1103 if (ret) {
1104 if (ret == -ENOMEM || ret == -ENOSPC)
1105 ret = -EBUSY;
1106 return ret;
1107 }
1108 }
1109
1110 ttm_bo_unmap_virtual(bo);
1111 lret = ttm_bo_wait_ctx(bo, ctx);
1112 if (lret < 0)
1113 return lret;
1114
1115 lret = ttm_tt_backup(bdev, bo->ttm, (struct ttm_backup_flags)
1116 {.purge = flags.purge,
1117 .writeback = flags.writeback});
1118
1119 if (lret > 0) {
1120 spin_lock(&bdev->lru_lock);
> 1121 ttm_resource_del_bulk_move_unevictable(bo->resource, bo);
1122 ttm_resource_move_to_lru_tail(bo->resource);
1123 spin_unlock(&bdev->lru_lock);
1124 }
1125
1126 if (lret < 0 && lret != -EINTR)
1127 return -EBUSY;
1128
1129 return lret;
1130 }
1131 EXPORT_SYMBOL(ttm_bo_shrink);
1132
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: [PATCH] drm/ttm: Fix ttm_bo_shrink() infinite LRU walk on backup failure
2026-05-11 16:24 [PATCH] drm/ttm: Fix ttm_bo_shrink() infinite LRU walk on backup failure Thomas Hellström
2026-05-12 13:30 ` Matthew Auld
2026-05-13 7:20 ` kernel test robot
@ 2026-05-13 10:24 ` kernel test robot
2026-05-16 5:07 ` Claude review: " Claude Code Review Bot
2026-05-16 5:07 ` Claude Code Review Bot
4 siblings, 0 replies; 6+ messages in thread
From: kernel test robot @ 2026-05-13 10:24 UTC (permalink / raw)
To: Thomas Hellström, intel-xe
Cc: oe-kbuild-all, Thomas Hellström, Christian König,
Huang Rui, Matthew Auld, Matthew Brost, Dave Airlie, dri-devel,
stable
Hi Thomas,
kernel test robot noticed the following build errors:
[auto build test ERROR on drm-misc/drm-misc-next]
[also build test ERROR on linus/master v7.1-rc3 next-20260508]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Thomas-Hellstr-m/drm-ttm-Fix-ttm_bo_shrink-infinite-LRU-walk-on-backup-failure/20260513-095356
base: https://gitlab.freedesktop.org/drm/misc/kernel.git drm-misc-next
patch link: https://lore.kernel.org/r/20260511162443.24352-1-thomas.hellstrom%40linux.intel.com
patch subject: [PATCH] drm/ttm: Fix ttm_bo_shrink() infinite LRU walk on backup failure
config: x86_64-allmodconfig (https://download.01.org/0day-ci/archive/20260513/202605131824.SbQ7agaE-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260513/202605131824.SbQ7agaE-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605131824.SbQ7agaE-lkp@intel.com/
All errors (new ones prefixed by >>):
>> drivers/gpu/drm/ttm/ttm_bo_util.c:1121:3: error: call to undeclared function 'ttm_resource_del_bulk_move_unevictable'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
1121 | ttm_resource_del_bulk_move_unevictable(bo->resource, bo);
| ^
drivers/gpu/drm/ttm/ttm_bo_util.c:1121:3: note: did you mean 'ttm_resource_del_bulk_move'?
include/drm/ttm/ttm_resource.h:449:6: note: 'ttm_resource_del_bulk_move' declared here
449 | void ttm_resource_del_bulk_move(struct ttm_resource *res,
| ^
1 error generated.
vim +/ttm_resource_del_bulk_move_unevictable +1121 drivers/gpu/drm/ttm/ttm_bo_util.c
1067
1068 /**
1069 * ttm_bo_shrink() - Helper to shrink a ttm buffer object.
1070 * @ctx: The struct ttm_operation_ctx used for the shrinking operation.
1071 * @bo: The buffer object.
1072 * @flags: Flags governing the shrinking behaviour.
1073 *
1074 * The function uses the ttm_tt_back_up functionality to back up or
1075 * purge a struct ttm_tt. If the bo is not in system, it's first
1076 * moved there.
1077 *
1078 * Return: The number of pages shrunken or purged, or
1079 * negative error code on failure.
1080 */
1081 long ttm_bo_shrink(struct ttm_operation_ctx *ctx, struct ttm_buffer_object *bo,
1082 const struct ttm_bo_shrink_flags flags)
1083 {
1084 static const struct ttm_place sys_placement_flags = {
1085 .fpfn = 0,
1086 .lpfn = 0,
1087 .mem_type = TTM_PL_SYSTEM,
1088 .flags = 0,
1089 };
1090 static struct ttm_placement sys_placement = {
1091 .num_placement = 1,
1092 .placement = &sys_placement_flags,
1093 };
1094 struct ttm_device *bdev = bo->bdev;
1095 long lret;
1096
1097 dma_resv_assert_held(bo->base.resv);
1098
1099 if (flags.allow_move && bo->resource->mem_type != TTM_PL_SYSTEM) {
1100 int ret = ttm_bo_validate(bo, &sys_placement, ctx);
1101
1102 /* Consider -ENOMEM and -ENOSPC non-fatal. */
1103 if (ret) {
1104 if (ret == -ENOMEM || ret == -ENOSPC)
1105 ret = -EBUSY;
1106 return ret;
1107 }
1108 }
1109
1110 ttm_bo_unmap_virtual(bo);
1111 lret = ttm_bo_wait_ctx(bo, ctx);
1112 if (lret < 0)
1113 return lret;
1114
1115 lret = ttm_tt_backup(bdev, bo->ttm, (struct ttm_backup_flags)
1116 {.purge = flags.purge,
1117 .writeback = flags.writeback});
1118
1119 if (lret > 0) {
1120 spin_lock(&bdev->lru_lock);
> 1121 ttm_resource_del_bulk_move_unevictable(bo->resource, bo);
1122 ttm_resource_move_to_lru_tail(bo->resource);
1123 spin_unlock(&bdev->lru_lock);
1124 }
1125
1126 if (lret < 0 && lret != -EINTR)
1127 return -EBUSY;
1128
1129 return lret;
1130 }
1131 EXPORT_SYMBOL(ttm_bo_shrink);
1132
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 6+ messages in thread* Claude review: drm/ttm: Fix ttm_bo_shrink() infinite LRU walk on backup failure
2026-05-11 16:24 [PATCH] drm/ttm: Fix ttm_bo_shrink() infinite LRU walk on backup failure Thomas Hellström
` (2 preceding siblings ...)
2026-05-13 10:24 ` kernel test robot
@ 2026-05-16 5:07 ` Claude Code Review Bot
2026-05-16 5:07 ` Claude Code Review Bot
4 siblings, 0 replies; 6+ messages in thread
From: Claude Code Review Bot @ 2026-05-16 5:07 UTC (permalink / raw)
To: dri-devel-reviews
Overall Series Review
Subject: drm/ttm: Fix ttm_bo_shrink() infinite LRU walk on backup failure
Author: =?UTF-8?q?Thomas=20Hellstr=C3=B6m?= <thomas.hellstrom@linux.intel.com>
Patches: 4
Reviewed: 2026-05-16T15:07:24.448965
---
This is a single patch fixing an infinite LRU walk bug in `ttm_bo_shrink()`, mirroring a prior fix (b2ed01e7ad) applied to the `ttm_bo_swapout()` path. The fix is logically correct: it eliminates the problematic pattern of removing a resource from bulk_move before the backup attempt and re-adding on failure, which could disrupt cursor-based LRU iteration and cause infinite walks.
The approach — only touching bulk_move/LRU state after successful backup — is clean and consistent with the analogous swapout fix.
**Key dependency:** This patch uses `ttm_resource_del_bulk_move_unevictable()`, which does not exist in the current drm-next tree. It is presumably introduced by commit b2ed01e7ad. This patch cannot be applied standalone; the prerequisite must be present.
---
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 6+ messages in thread* Claude review: drm/ttm: Fix ttm_bo_shrink() infinite LRU walk on backup failure
2026-05-11 16:24 [PATCH] drm/ttm: Fix ttm_bo_shrink() infinite LRU walk on backup failure Thomas Hellström
` (3 preceding siblings ...)
2026-05-16 5:07 ` Claude review: " Claude Code Review Bot
@ 2026-05-16 5:07 ` Claude Code Review Bot
4 siblings, 0 replies; 6+ messages in thread
From: Claude Code Review Bot @ 2026-05-16 5:07 UTC (permalink / raw)
To: dri-devel-reviews
Patch Review
**The bug (before the fix):**
The old code in `ttm_bo_util.c:1115-1129` did:
```c
if (bo->bulk_move) {
spin_lock(&bdev->lru_lock);
ttm_resource_del_bulk_move(bo->resource, bo);
spin_unlock(&bdev->lru_lock);
}
lret = ttm_tt_backup(...);
if (lret <= 0 && bo->bulk_move) {
spin_lock(&bdev->lru_lock);
ttm_resource_add_bulk_move(bo->resource, bo);
spin_unlock(&bdev->lru_lock);
}
```
The `del_bulk_move` before backup removes the resource from the bulk_move list. If backup fails, `add_bulk_move` re-adds it, but this can reinsert the resource at a position the LRU cursor hasn't passed yet, causing the cursor to re-encounter the same BO, leading to an infinite walk. Critically, no `ttm_resource_move_to_lru_tail()` was called on failure, so the resource never moved away from the cursor.
**The fix:**
```c
lret = ttm_tt_backup(...);
if (lret > 0) {
spin_lock(&bdev->lru_lock);
ttm_resource_del_bulk_move_unevictable(bo->resource, bo);
ttm_resource_move_to_lru_tail(bo->resource);
spin_unlock(&bdev->lru_lock);
}
```
- **Failure path**: No bulk_move manipulation at all. The resource stays in place, the cursor naturally advances past it. This is correct.
- **Success path**: Uses `del_bulk_move_unevictable` then `move_to_lru_tail`. This is necessary because after successful backup, `TTM_TT_FLAG_BACKED_UP` is set (`ttm_tt.c:292`), making `ttm_resource_unevictable()` return true. The regular `ttm_resource_del_bulk_move()` would be a **no-op** here since it checks `!ttm_resource_unevictable(res, bo)` and skips unevictable resources (`ttm_resource.c:291`). The `_unevictable` variant is essential to correctly remove the resource from bulk_move tracking during this evictable-to-unevictable transition.
**Correctness: Looks correct.** The approach is sound and matches the swapout fix pattern.
**Minor observations:**
1. **Dependency not stated in the commit message.** The patch references commit b2ed01e7ad as the analogous fix but doesn't explicitly note that `ttm_resource_del_bulk_move_unevictable()` is introduced by that commit. Adding a note (e.g., a "Depends-on:" tag or a mention in the commit body) would help patch management, especially given the `Cc: stable` tag — both patches need to land together in stable.
2. **`Assisted-by: GitHub_Copilot:claude-opus-4.6`** — This tag format is non-standard. The kernel community hasn't converged on a standardized AI-assistance tag; existing conventions lean toward something with the provider name not GitHub Copilot when the model is Anthropic's Claude. This is minor and a process/policy matter, not a code issue.
3. **No `move_to_lru_tail` on failure, unlike the swapout path.** The swapout path (`ttm_bo.c:1186-1189`) calls `move_to_lru_tail` even on failure. The shrink path omits this, which is fine: in the shrink caller (`xe_shrinker.c:82-83`), any negative return aborts the walk immediately (`if (lret < 0) return lret`), and a zero return simply means no pages freed so the cursor advances normally. The LRU tail movement on failure is unnecessary here. However, if you wanted maximum consistency with the swapout path, adding `move_to_lru_tail` on failure would be a defensive improvement to ensure the resource moves to the tail even if callers change in the future.
**Verdict:** The fix is correct and necessary. The main thing to ensure is that the dependency on `ttm_resource_del_bulk_move_unevictable()` is properly tracked for stable backports.
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 6+ messages in thread