* [PATCH v5] cgroup/dmem: implement dmem.high soft limit via prioritized eviction
@ 2026-05-31 8:45 Qiliang Yuan
2026-06-04 5:03 ` Claude review: " Claude Code Review Bot
2026-06-04 5:03 ` Claude Code Review Bot
0 siblings, 2 replies; 7+ messages in thread
From: Qiliang Yuan @ 2026-05-31 8:45 UTC (permalink / raw)
To: Christian Koenig, Huang Rui, Matthew Auld, Matthew Brost,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
Simona Vetter, Tejun Heo, Johannes Weiner, Michal Koutný,
Natalie Vock
Cc: dri-devel, linux-kernel, cgroups, Qiliang Yuan
The dmem cgroup v2 controller currently only provides a hard "max"
limit, which causes immediate allocation failures when a cgroup's
device memory usage reaches its quota. GPU-bound AI workloads need
smoother over-subscription support: a soft limit that temporarily
allows excess usage while applying backpressure through reclaim
rather than outright failure.
Add dmem.high, a soft limit that penalizes over-limit cgroups by
evicting their buffer objects first when eviction is triggered (e.g.
due to a "max" limit hit). Unlike the rejected v1 approach which
used sleep-on-allocation throttling, this version provides a
meaningful recovery action through prioritized reclaim.
Expose "high" as a new cgroupfs control file per region via
set_resource_high() and get_resource_high(), and initialize it to
PAGE_COUNTER_MAX in reset_all_resource_limits(). Like get_resource_max(),
get_resource_high() returns PAGE_COUNTER_MAX when the pool is NULL.
Extend dmem_cgroup_state_evict_valuable() with a "try_high"
parameter. When set, the function walks the page_counter parent
chain to check whether any ancestor exceeds its high limit, and
verifies that the pool is above its effective minimum to respect
dmem.min protection. For the limit-hitting cgroup's own BOs, the
ancestry check is skipped but the high threshold still applies.
Refactor ttm_bo_evict_alloc() into a 3-pass eviction strategy.
Pass 1 targets only BOs whose cgroup exceeds dmem.high, using a
blocking lock when a ticket is available or trylock otherwise.
Pass 2 falls back to the standard above-elow trylock eviction.
Pass 3+ uses proper locking and repeats while making progress
with the existing low-watermark fallback.
Signed-off-by: Qiliang Yuan <realwujing@gmail.com>
---
Introduce a "high" soft limit for the dmem cgroup v2 controller.
When a "max" limit is hit and eviction is triggered, buffer objects
belonging to cgroups that exceed their dmem.high limit are targeted
first, providing a meaningful recovery action through reclaim.
The dmem cgroup currently only supports hard "max" limits, which
cause immediate allocation failures for GPU-bound workloads. A soft
limit enables smoother over-subscription by penalizing over-limit
cgroups via prioritized eviction rather than outright rejection.
The implementation adds a "high" cgroupfs control file per region,
a try_high parameter to dmem_cgroup_state_evict_valuable() for
tier-1 eviction, and a 3-pass strategy in ttm_bo_evict_alloc().
---
V4 -> V5:
- Restore the original control flow in dmem_cgroup_state_evict_valuable():
test_pool is no longer dereferenced before the ancestry checks, fixing
a NULL pointer dereference on BOs without a cgroup. The limit_pool
NULL-to-root-cgroup resolution is now performed before the try_high
block, fixing a panic during global memory pressure eviction.
- Keep the try_high check for limit_pool == test_pool inside the existing
early-return branch to avoid bypassing the hierarchy constraint check
that prevents cross-cgroup eviction.
- Use a blocking lock in Pass 1 only when a ticket is available
(trylock otherwise), addressing the deadlock risk of blocking without
a valid ww_acquire_ctx.
- Explicitly reset trylock_only to true before Pass 2 so it does not
inherit Pass 1's blocking behavior.
V3 -> V4:
- Use a blocking lock in Pass 1 instead of trylock to ensure
over-limit cgroups are penalized even when their BOs are actively
in use, as requested by Maarten Lankhorst.
- Evaluate the try_high condition before the limit_pool == test_pool
early-return so that the limit-hitting cgroup's own BOs are also
filtered by dmem.high.
- Remove the high-priority compensation retry at the start of Pass 3,
which is no longer needed now that Pass 1 uses a blocking lock.
V2 -> V3:
- Walk the page_counter parent chain in the try_high pass to prevent
child cgroups from evading the penalty when a parent cgroup exceeds
its dmem.high limit.
- Check dmem.min protection in the try_high pass to avoid evicting
BOs below the effective minimum.
- Add a properly-locked high-priority retry at the beginning of Pass 3
so that actively-used over-limit BOs (which failed trylock in Pass 1)
are not skipped while innocent cgroups are evicted.
- Fix get_resource_high(NULL) returning 0 instead of PAGE_COUNTER_MAX
to match the behavior of get_resource_max().
V1 -> V2:
- Replace sleep-on-allocation throttling with prioritized eviction.
When a "max" limit is hit, BOs from cgroups exceeding dmem.high are
evicted first in a dedicated pass. No throttling or sleeping is
performed in the charge path.
- Remove task throttling (schedule_timeout_killable, TIF_NOTIFY_RESUME,
resume_user_mode_work() integration) entirely.
- Add dmem.high cgroupfs control file per region.
- Extend dmem_cgroup_state_evict_valuable() with try_high parameter
to target over-limit cgroups as tier-1 eviction.
- Refactor ttm_bo_evict_alloc() into a 3-pass eviction strategy:
(1) trylock: evict only BOs exceeding dmem.high
(2) trylock: above-elow
(3) proper-lock: repeat with low fallback.
- Initialize high to PAGE_COUNTER_MAX in reset_all_resource_limits().
v4: https://lore.kernel.org/r/20260530-feature-dmem-high-v4-1-ee7c6ec1c8da@gmail.com
v3: https://lore.kernel.org/r/20260528-feature-dmem-high-v3-1-c642b34bcb2f@gmail.com
v2: https://lore.kernel.org/r/20260522-feature-dmem-high-v2-1-1d7d4a0fa5da@gmail.com
v1: https://lore.kernel.org/all/20260520-feature-dmem-high-v1-1-97ca0cb7f95a@gmail.com
---
drivers/gpu/drm/ttm/ttm_bo.c | 33 ++++++++++++++----
include/linux/cgroup_dmem.h | 4 +--
kernel/cgroup/dmem.c | 79 +++++++++++++++++++++++++++++++++++++++++---
3 files changed, 103 insertions(+), 13 deletions(-)
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index bcd76f6bb7f02..21fe34fd43eec 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -505,6 +505,8 @@ struct ttm_bo_evict_walk {
/** @limit_pool: Which pool limit we should test against */
struct dmem_cgroup_pool_state *limit_pool;
+ /** @try_high: Whether to only evict BO's above the high watermark (first pass) */
+ bool try_high;
/** @try_low: Whether we should attempt to evict BO's with low watermark threshold */
bool try_low;
/** @hit_low: If we cannot evict a bo when @try_low is false (first pass) */
@@ -518,7 +520,8 @@ static s64 ttm_bo_evict_cb(struct ttm_lru_walk *walk, struct ttm_buffer_object *
s64 lret;
if (!dmem_cgroup_state_evict_valuable(evict_walk->limit_pool, bo->resource->css,
- evict_walk->try_low, &evict_walk->hit_low))
+ evict_walk->try_high, evict_walk->try_low,
+ &evict_walk->hit_low))
return 0;
if (bo->pin_count || !bo->bdev->funcs->eviction_valuable(bo, evict_walk->place))
@@ -577,31 +580,47 @@ static int ttm_bo_evict_alloc(struct ttm_device *bdev,
};
s64 lret;
- evict_walk.walk.arg.trylock_only = true;
+ /*
+ * Pass 1 (high-priority): Evict only BOs whose cgroup exceeds its
+ * dmem.high soft limit. A blocking lock is used when a ticket is
+ * available to ensure over-limit cgroups are penalized even when
+ * their BOs are actively in use; trylock otherwise.
+ */
+ evict_walk.walk.arg.trylock_only = !ticket;
+ evict_walk.try_high = true;
lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
+ evict_walk.try_high = false;
+ if (lret)
+ goto out;
- /* One more attempt if we hit low limit? */
+ /*
+ * Pass 2 (trylock): Evict BOs above the effective low watermark.
+ * Falls back to low-priority eviction if needed.
+ */
+ evict_walk.walk.arg.trylock_only = true;
+ lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
if (!lret && evict_walk.hit_low) {
evict_walk.try_low = true;
lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
}
+
if (lret || !ticket)
goto out;
- /* Reset low limit */
+ /*
+ * Pass 3+ (properly locked): Evict while making progress.
+ * Reset flags and retry with try_low if we hit the low watermark.
+ */
evict_walk.try_low = evict_walk.hit_low = false;
- /* If ticket-locking, repeat while making progress. */
evict_walk.walk.arg.trylock_only = false;
retry:
do {
- /* The walk may clear the evict_walk.walk.ticket field */
evict_walk.walk.arg.ticket = ticket;
evict_walk.evicted = 0;
lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
} while (!lret && evict_walk.evicted);
- /* We hit the low limit? Try once more */
if (!lret && evict_walk.hit_low && !evict_walk.try_low) {
evict_walk.try_low = true;
goto retry;
diff --git a/include/linux/cgroup_dmem.h b/include/linux/cgroup_dmem.h
index dd4869f1d736e..06115d35509b1 100644
--- a/include/linux/cgroup_dmem.h
+++ b/include/linux/cgroup_dmem.h
@@ -23,7 +23,7 @@ int dmem_cgroup_try_charge(struct dmem_cgroup_region *region, u64 size,
void dmem_cgroup_uncharge(struct dmem_cgroup_pool_state *pool, u64 size);
bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
struct dmem_cgroup_pool_state *test_pool,
- bool ignore_low, bool *ret_hit_low);
+ bool try_high, bool ignore_low, bool *ret_hit_low);
void dmem_cgroup_pool_state_put(struct dmem_cgroup_pool_state *pool);
#else
@@ -54,7 +54,7 @@ static inline void dmem_cgroup_uncharge(struct dmem_cgroup_pool_state *pool, u64
static inline
bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
struct dmem_cgroup_pool_state *test_pool,
- bool ignore_low, bool *ret_hit_low)
+ bool try_high, bool ignore_low, bool *ret_hit_low)
{
return true;
}
diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c
index 4753a67d0f0f2..29f8c68e92f7f 100644
--- a/kernel/cgroup/dmem.c
+++ b/kernel/cgroup/dmem.c
@@ -156,6 +156,12 @@ set_resource_low(struct dmem_cgroup_pool_state *pool, u64 val)
page_counter_set_low(&pool->cnt, val);
}
+static void
+set_resource_high(struct dmem_cgroup_pool_state *pool, u64 val)
+{
+ page_counter_set_high(&pool->cnt, val);
+}
+
static void
set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val)
{
@@ -167,6 +173,11 @@ static u64 get_resource_low(struct dmem_cgroup_pool_state *pool)
return pool ? READ_ONCE(pool->cnt.low) : 0;
}
+static u64 get_resource_high(struct dmem_cgroup_pool_state *pool)
+{
+ return pool ? READ_ONCE(pool->cnt.high) : PAGE_COUNTER_MAX;
+}
+
static u64 get_resource_min(struct dmem_cgroup_pool_state *pool)
{
return pool ? READ_ONCE(pool->cnt.min) : 0;
@@ -186,6 +197,7 @@ static void reset_all_resource_limits(struct dmem_cgroup_pool_state *rpool)
{
set_resource_min(rpool, 0);
set_resource_low(rpool, 0);
+ set_resource_high(rpool, PAGE_COUNTER_MAX);
set_resource_max(rpool, PAGE_COUNTER_MAX);
}
@@ -289,10 +301,13 @@ dmem_cgroup_calculate_protection(struct dmem_cgroup_pool_state *limit_pool,
* dmem_cgroup_state_evict_valuable() - Check if we should evict from test_pool
* @limit_pool: The pool for which we hit limits
* @test_pool: The pool for which to test
+ * @try_high: Only evict BOs whose usage exceeds the high limit (first pass)
* @ignore_low: Whether we have to respect low watermarks.
* @ret_hit_low: Pointer to whether it makes sense to consider low watermark.
*
* This function returns true if we can evict from @test_pool, false if not.
+ * When @try_high is set, only pools with usage above their high limit are
+ * evictable, enabling prioritized eviction of over-limit cgroups.
* When returning false and @ignore_low is false, @ret_hit_low may
* be set to true to indicate this function can be retried with @ignore_low
* set to true.
@@ -301,15 +316,26 @@ dmem_cgroup_calculate_protection(struct dmem_cgroup_pool_state *limit_pool,
*/
bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
struct dmem_cgroup_pool_state *test_pool,
- bool ignore_low, bool *ret_hit_low)
+ bool try_high, bool ignore_low, bool *ret_hit_low)
{
struct dmem_cgroup_pool_state *pool = test_pool;
struct page_counter *ctest;
u64 used, min, low;
- /* Can always evict from current pool, despite limits */
- if (limit_pool == test_pool)
+ /*
+ * When the limit-hitting cgroup's own BOs are being considered
+ * in try_high mode, only evict them if their pool exceeds its
+ * own dmem.high limit. For non-try_high mode, maintain the
+ * existing behavior: always evict from the limit-hitting pool.
+ */
+ if (limit_pool == test_pool) {
+ if (try_high) {
+ ctest = &test_pool->cnt;
+ used = page_counter_read(ctest);
+ return used > READ_ONCE(ctest->high);
+ }
return true;
+ }
if (limit_pool) {
if (!parent_dmemcs(limit_pool->cs))
@@ -330,10 +356,38 @@ bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
}
ctest = &test_pool->cnt;
+ used = page_counter_read(ctest);
+
+ if (try_high) {
+ struct page_counter *c;
+
+ /*
+ * Walk the page_counter parent chain to check whether any
+ * ancestor cgroup exceeds its dmem.high limit. This prevents
+ * child cgroups from evading the penalty when a parent cgroup
+ * is over its high limit.
+ */
+ if (used <= READ_ONCE(ctest->high)) {
+ for (c = ctest->parent; c; c = c->parent) {
+ if (page_counter_read(c) > READ_ONCE(c->high))
+ break;
+ }
+ if (!c)
+ return false;
+ }
+
+ /*
+ * Respect dmem.min protection: do not evict BOs below the
+ * effective minimum even during the high-priority pass.
+ */
+ dmem_cgroup_calculate_protection(limit_pool, test_pool);
+ min = READ_ONCE(ctest->emin);
+
+ return used > min;
+ }
dmem_cgroup_calculate_protection(limit_pool, test_pool);
- used = page_counter_read(ctest);
min = READ_ONCE(ctest->emin);
if (used <= min)
@@ -835,6 +889,17 @@ static ssize_t dmem_cgroup_region_low_write(struct kernfs_open_file *of,
return dmemcg_limit_write(of, buf, nbytes, off, set_resource_low);
}
+static int dmem_cgroup_region_high_show(struct seq_file *sf, void *v)
+{
+ return dmemcg_limit_show(sf, v, get_resource_high);
+}
+
+static ssize_t dmem_cgroup_region_high_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ return dmemcg_limit_write(of, buf, nbytes, off, set_resource_high);
+}
+
static int dmem_cgroup_region_max_show(struct seq_file *sf, void *v)
{
return dmemcg_limit_show(sf, v, get_resource_max);
@@ -868,6 +933,12 @@ static struct cftype files[] = {
.seq_show = dmem_cgroup_region_low_show,
.flags = CFTYPE_NOT_ON_ROOT,
},
+ {
+ .name = "high",
+ .write = dmem_cgroup_region_high_write,
+ .seq_show = dmem_cgroup_region_high_show,
+ .flags = CFTYPE_NOT_ON_ROOT,
+ },
{
.name = "max",
.write = dmem_cgroup_region_max_write,
---
base-commit: ab5fce87a778cb780a05984a2ca448f2b41aafbf
change-id: 20260519-feature-dmem-high-16997148dc38
Best regards,
--
Qiliang Yuan <realwujing@gmail.com>
^ permalink raw reply related [flat|nested] 7+ messages in thread* Claude review: cgroup/dmem: implement dmem.high soft limit via prioritized eviction
2026-05-31 8:45 [PATCH v5] cgroup/dmem: implement dmem.high soft limit via prioritized eviction Qiliang Yuan
@ 2026-06-04 5:03 ` Claude Code Review Bot
2026-06-04 5:03 ` Claude Code Review Bot
1 sibling, 0 replies; 7+ messages in thread
From: Claude Code Review Bot @ 2026-06-04 5:03 UTC (permalink / raw)
To: dri-devel-reviews
Overall Series Review
Subject: cgroup/dmem: implement dmem.high soft limit via prioritized eviction
Author: Qiliang Yuan <realwujing@gmail.com>
Patches: 1
Reviewed: 2026-06-04T15:03:19.411766
---
This is a single-patch v5 series adding a `dmem.high` soft limit to the dmem cgroup v2 controller. The concept is sound: when a hard `max` limit is hit and eviction is triggered, BOs from cgroups exceeding their `dmem.high` are targeted first via a new "Pass 1" in `ttm_bo_evict_alloc()`. This provides a useful middle ground between no limit and hard failure.
The patch has gone through significant iteration (v1→v5) addressing NULL-pointer issues, deadlock risks, and control flow problems. The v5 changes are clearly motivated by real bugs found in v4. However, several issues remain:
**Key concerns:**
1. **Missing `elow` check in the `try_high` path** — The try_high block respects `emin` but silently ignores `elow`, meaning BOs protected by `dmem.low` can be evicted during the high-priority pass.
2. **No documentation** — The patch adds a new cgroupfs control file but provides no documentation update (e.g., `Documentation/admin-guide/cgroup-v2.rst` or similar).
3. **The `try_high` path for `limit_pool == test_pool` skips `emin` protection** — When a cgroup hits its own limit and the same pool is being tested, the code only checks `used > high` but does not check against `emin`, unlike the cross-cgroup path.
4. **The 3-pass strategy has a subtle interaction** — Pass 1 uses blocking locks when a ticket is available, but Pass 2 always uses trylock. If Pass 1 fails to evict anything (all over-high BOs were locked and no ticket), and Pass 2 also fails to evict, the code falls through to Pass 3 which is the existing properly-locked path. This seems correct but could benefit from a comment about the fallback chain.
**Minor issues:**
- The cover letter text in the commit message body is duplicative with the commit message itself.
- Typo in the v2 changelog: "above-elow" (line 199).
---
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 7+ messages in thread
* Claude review: cgroup/dmem: implement dmem.high soft limit via prioritized eviction
2026-05-31 8:45 [PATCH v5] cgroup/dmem: implement dmem.high soft limit via prioritized eviction Qiliang Yuan
2026-06-04 5:03 ` Claude review: " Claude Code Review Bot
@ 2026-06-04 5:03 ` Claude Code Review Bot
1 sibling, 0 replies; 7+ messages in thread
From: Claude Code Review Bot @ 2026-06-04 5:03 UTC (permalink / raw)
To: dri-devel-reviews
Patch Review
#### kernel/cgroup/dmem.c — `dmem_cgroup_state_evict_valuable()`
**Issue 1 (medium): `limit_pool == test_pool` path in try_high mode skips emin protection**
```c
if (limit_pool == test_pool) {
if (try_high) {
ctest = &test_pool->cnt;
used = page_counter_read(ctest);
return used > READ_ONCE(ctest->high);
}
return true;
}
```
When a cgroup hits its own limit and `try_high` is set, this returns true solely based on `used > high` without checking `emin`. In contrast, the cross-cgroup `try_high` path (added later in the function) explicitly calls `dmem_cgroup_calculate_protection()` and checks `used > min`. This inconsistency means that in the self-eviction case during Pass 1, BOs below `emin` protection could be evicted. While `limit_pool == test_pool` implies this is the cgroup evicting its own BOs (so protection may be less relevant since the cgroup is over its own limits), it is still an inconsistency with the later code path and should at minimum be documented as intentional.
**Issue 2 (medium): try_high path does not respect elow**
The try_high cross-cgroup path checks `emin` but not `elow`:
```c
if (try_high) {
...
dmem_cgroup_calculate_protection(limit_pool, test_pool);
min = READ_ONCE(ctest->emin);
return used > min;
}
```
The normal (non-try_high) path checks both `emin` and `elow` (with `ret_hit_low` signaling). The try_high path ignores `elow` entirely. This means during Pass 1, a BO from a cgroup that is over its parent's `dmem.high` but protected by `dmem.low` could be evicted. If this is intentional (i.e., "high overrides low for the penalization pass"), it should be documented. If not, it's a correctness issue.
**Issue 3 (minor): `used` variable read twice unnecessarily**
```c
ctest = &test_pool->cnt;
used = page_counter_read(ctest);
if (try_high) {
...
if (used <= READ_ONCE(ctest->high)) {
```
The `used = page_counter_read(ctest)` is hoisted before the `try_high` block. This read was previously placed after `dmem_cgroup_calculate_protection()`. Moving it earlier means the value is read before protection calculation in the non-try_high path. Looking at the diff more carefully, the original code had:
```c
ctest = &test_pool->cnt;
dmem_cgroup_calculate_protection(limit_pool, test_pool);
used = page_counter_read(ctest);
```
The patch changes this to:
```c
ctest = &test_pool->cnt;
used = page_counter_read(ctest);
if (try_high) { ... }
dmem_cgroup_calculate_protection(limit_pool, test_pool);
min = READ_ONCE(ctest->emin);
```
This reordering means `used` is now read before `dmem_cgroup_calculate_protection()` in the normal path too. In practice, `page_counter_read()` just reads an atomic counter and `dmem_cgroup_calculate_protection()` only writes `emin`/`elow`, so this reordering is harmless. But it's a subtle behavioral change that's not called out.
#### kernel/cgroup/dmem.c — New accessors and cgroupfs file
The `set_resource_high()`, `get_resource_high()`, `reset_all_resource_limits()` changes, and the new cgroupfs entries all look correct and follow the existing patterns exactly. `get_resource_high()` correctly returns `PAGE_COUNTER_MAX` for `NULL` pool, matching `get_resource_max()`.
The new cftype entry is placed between `low` and `max` in the files array, which is the natural ordering (min < low < high < max). Good.
#### include/linux/cgroup_dmem.h
The stub for `!CONFIG_CGROUP_DMEM` correctly adds the `try_high` parameter and ignores it (always returns `true`). This is correct — when dmem cgroups are disabled, all BOs are evictable.
#### drivers/gpu/drm/ttm/ttm_bo.c — 3-pass eviction strategy
**Issue 4 (low): Pass 1 blocking semantics could still cause latency issues**
```c
evict_walk.walk.arg.trylock_only = !ticket;
evict_walk.try_high = true;
lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
```
When a ticket is available, Pass 1 uses blocking locks. This is intentional (v4 changelog says "as requested by Maarten Lankhorst") to ensure over-limit cgroups are actually penalized. However, this means Pass 1 could block waiting for a BO that ultimately turns out not to be in an over-high cgroup. The `dmem_cgroup_state_evict_valuable()` check happens inside `ttm_bo_evict_cb()`, which is called after the BO lock is acquired. So the blocking lock is acquired before knowing whether this BO is even a candidate for try_high eviction. This is a pre-existing pattern in the TTM LRU walk (the low/min checks also happen after locking), so it's not a new issue introduced by this patch, but it's worth noting for the commit message.
**Issue 5 (low): hit_low not reset between Pass 1 and Pass 2**
```c
evict_walk.try_high = false;
if (lret)
goto out;
/* Pass 2 */
evict_walk.walk.arg.trylock_only = true;
lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
if (!lret && evict_walk.hit_low) {
```
`hit_low` is not reset between Pass 1 and Pass 2. If Pass 1 happened to set `hit_low` (which can't actually happen since the try_high path never sets `*ret_hit_low`), this would cause Pass 2 to immediately try the low fallback. Since the try_high path in `dmem_cgroup_state_evict_valuable()` returns before reaching the `*ret_hit_low = true` line, this is not a bug in practice. But it relies on the invariant that try_high never touches hit_low — fragile if the code evolves.
**Overall comment on the ttm_bo.c changes:** The 3-pass structure is clean and readable. The comments added are appropriate and helpful. The removal of several now-redundant comments from the original code is fine.
#### Commit message
The commit message is well-written and clearly explains the motivation and approach. The below-the-`---` section duplicates much of the above-the-`---` content, which is unusual — typically the section below `---` is for changelog and notes, not a re-summary. The v4→v5 changelog is valuable and clearly describes the fixes.
#### Missing: Documentation
There is no update to any documentation file. The `dmem` cgroup controller should have its `high` knob documented alongside `min`, `low`, and `max`. This is typically expected for new cgroupfs control files.
---
**Summary:** The patch is well-motivated and has clearly improved through the v1→v5 iterations. The main substantive concerns are (1) inconsistent `emin`/`elow` protection in the try_high paths and (2) missing documentation for the new cgroupfs control file. The TTM eviction refactor is clean and correct. I'd suggest addressing the protection consistency issues and adding documentation before merging.
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v6] cgroup/dmem: implement dmem.high soft limit via prioritized eviction
@ 2026-05-31 9:52 Qiliang Yuan
2026-06-04 4:56 ` Claude review: " Claude Code Review Bot
2026-06-04 4:56 ` Claude Code Review Bot
0 siblings, 2 replies; 7+ messages in thread
From: Qiliang Yuan @ 2026-05-31 9:52 UTC (permalink / raw)
To: Christian Koenig, Huang Rui, Matthew Auld, Matthew Brost,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
Simona Vetter, Tejun Heo, Johannes Weiner, Michal Koutný,
Natalie Vock
Cc: dri-devel, linux-kernel, cgroups, Qiliang Yuan
The dmem cgroup v2 controller currently only provides a hard "max"
limit, which causes immediate allocation failures when a cgroup's
device memory usage reaches its quota. GPU-bound AI workloads need
smoother over-subscription support: a soft limit that temporarily
allows excess usage while applying backpressure through reclaim
rather than outright failure.
Add dmem.high, a soft limit that penalizes over-limit cgroups by
evicting their buffer objects first when eviction is triggered (e.g.
due to a "max" limit hit). Unlike the rejected v1 approach which
used sleep-on-allocation throttling, this version provides a
meaningful recovery action through prioritized reclaim.
Expose "high" as a new cgroupfs control file per region via
set_resource_high() and get_resource_high(), and initialize it to
PAGE_COUNTER_MAX in reset_all_resource_limits(). Like get_resource_max(),
get_resource_high() returns PAGE_COUNTER_MAX when the pool is NULL.
Extend dmem_cgroup_state_evict_valuable() with a "try_high"
parameter. When set, the function walks the page_counter parent
chain to check whether any ancestor exceeds its high limit, and
verifies that the pool is above its effective minimum to respect
dmem.min protection. For the limit-hitting cgroup's own BOs, the
ancestry check is skipped but the high threshold still applies.
When CONFIG_CGROUP_DMEM is disabled, the stub returns false in
try_high mode so the first pass has no effect.
Refactor ttm_bo_evict_alloc() into a 3-pass eviction strategy.
Pass 1 targets only BOs whose cgroup exceeds dmem.high, using a
blocking lock when a ticket is available or trylock otherwise.
Pass 2 falls back to the standard above-elow trylock eviction.
Pass 3+ uses proper locking and repeats while making progress
with the existing low-watermark fallback.
Signed-off-by: Qiliang Yuan <realwujing@gmail.com>
---
Introduce a "high" soft limit for the dmem cgroup v2 controller.
When a "max" limit is hit and eviction is triggered, buffer objects
belonging to cgroups that exceed their dmem.high limit are targeted
first, providing a meaningful recovery action through reclaim.
The dmem cgroup currently only supports hard "max" limits, which
cause immediate allocation failures for GPU-bound workloads. A soft
limit enables smoother over-subscription by penalizing over-limit
cgroups via prioritized eviction rather than outright rejection.
The implementation adds a "high" cgroupfs control file per region,
a try_high parameter to dmem_cgroup_state_evict_valuable() for
tier-1 eviction, and a 3-pass strategy in ttm_bo_evict_alloc().
---
V5 -> V6:
- Guard the try_high dereference of test_pool->cnt with a NULL check
to prevent a kernel panic during global memory pressure eviction
when a BO has no associated cgroup.
- Make the disabled-cgroup stub for dmem_cgroup_state_evict_valuable()
return false in try_high mode so the stub does not incorrectly
enable Pass 1 when CONFIG_CGROUP_DMEM=n.
V4 -> V5:
- Restore the original control flow in dmem_cgroup_state_evict_valuable():
test_pool is no longer dereferenced before the ancestry checks, fixing
a NULL pointer dereference on BOs without a cgroup. The limit_pool
NULL-to-root-cgroup resolution is now performed before the try_high
block, fixing a panic during global memory pressure eviction.
- Keep the try_high check for limit_pool == test_pool inside the existing
early-return branch to avoid bypassing the hierarchy constraint check
that prevents cross-cgroup eviction.
- Use a blocking lock in Pass 1 only when a ticket is available
(trylock otherwise), addressing the deadlock risk of blocking without
a valid ww_acquire_ctx.
- Explicitly reset trylock_only to true before Pass 2 so it does not
inherit Pass 1's blocking behavior.
V3 -> V4:
- Use a blocking lock in Pass 1 instead of trylock to ensure
over-limit cgroups are penalized even when their BOs are actively
in use, as requested by Maarten Lankhorst.
- Evaluate the try_high condition before the limit_pool == test_pool
early-return so that the limit-hitting cgroup's own BOs are also
filtered by dmem.high.
- Remove the high-priority compensation retry at the start of Pass 3,
which is no longer needed now that Pass 1 uses a blocking lock.
V2 -> V3:
- Walk the page_counter parent chain in the try_high pass to prevent
child cgroups from evading the penalty when a parent cgroup exceeds
its dmem.high limit.
- Check dmem.min protection in the try_high pass to avoid evicting
BOs below the effective minimum.
- Add a properly-locked high-priority retry at the beginning of Pass 3
so that actively-used over-limit BOs (which failed trylock in Pass 1)
are not skipped while innocent cgroups are evicted.
- Fix get_resource_high(NULL) returning 0 instead of PAGE_COUNTER_MAX
to match the behavior of get_resource_max().
V1 -> V2:
- Replace sleep-on-allocation throttling with prioritized eviction.
When a "max" limit is hit, BOs from cgroups exceeding dmem.high are
evicted first in a dedicated pass. No throttling or sleeping is
performed in the charge path.
- Remove task throttling (schedule_timeout_killable, TIF_NOTIFY_RESUME,
resume_user_mode_work() integration) entirely.
- Add dmem.high cgroupfs control file per region.
- Extend dmem_cgroup_state_evict_valuable() with try_high parameter
to target over-limit cgroups as tier-1 eviction.
- Refactor ttm_bo_evict_alloc() into a 3-pass eviction strategy:
(1) trylock: evict only BOs exceeding dmem.high
(2) trylock: above-elow
(3) proper-lock: repeat with low fallback.
- Initialize high to PAGE_COUNTER_MAX in reset_all_resource_limits().
v5: https://lore.kernel.org/r/20260531-feature-dmem-high-v5-1-1c6c532b26a9@gmail.com
v4: https://lore.kernel.org/r/20260530-feature-dmem-high-v4-1-ee7c6ec1c8da@gmail.com
v3: https://lore.kernel.org/r/20260528-feature-dmem-high-v3-1-c642b34bcb2f@gmail.com
v2: https://lore.kernel.org/r/20260522-feature-dmem-high-v2-1-1d7d4a0fa5da@gmail.com
v1: https://lore.kernel.org/all/20260520-feature-dmem-high-v1-1-97ca0cb7f95a@gmail.com
---
drivers/gpu/drm/ttm/ttm_bo.c | 33 ++++++++++++++----
include/linux/cgroup_dmem.h | 6 ++--
kernel/cgroup/dmem.c | 79 +++++++++++++++++++++++++++++++++++++++++---
3 files changed, 105 insertions(+), 13 deletions(-)
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index bcd76f6bb7f02..21fe34fd43eec 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -505,6 +505,8 @@ struct ttm_bo_evict_walk {
/** @limit_pool: Which pool limit we should test against */
struct dmem_cgroup_pool_state *limit_pool;
+ /** @try_high: Whether to only evict BO's above the high watermark (first pass) */
+ bool try_high;
/** @try_low: Whether we should attempt to evict BO's with low watermark threshold */
bool try_low;
/** @hit_low: If we cannot evict a bo when @try_low is false (first pass) */
@@ -518,7 +520,8 @@ static s64 ttm_bo_evict_cb(struct ttm_lru_walk *walk, struct ttm_buffer_object *
s64 lret;
if (!dmem_cgroup_state_evict_valuable(evict_walk->limit_pool, bo->resource->css,
- evict_walk->try_low, &evict_walk->hit_low))
+ evict_walk->try_high, evict_walk->try_low,
+ &evict_walk->hit_low))
return 0;
if (bo->pin_count || !bo->bdev->funcs->eviction_valuable(bo, evict_walk->place))
@@ -577,31 +580,47 @@ static int ttm_bo_evict_alloc(struct ttm_device *bdev,
};
s64 lret;
- evict_walk.walk.arg.trylock_only = true;
+ /*
+ * Pass 1 (high-priority): Evict only BOs whose cgroup exceeds its
+ * dmem.high soft limit. A blocking lock is used when a ticket is
+ * available to ensure over-limit cgroups are penalized even when
+ * their BOs are actively in use; trylock otherwise.
+ */
+ evict_walk.walk.arg.trylock_only = !ticket;
+ evict_walk.try_high = true;
lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
+ evict_walk.try_high = false;
+ if (lret)
+ goto out;
- /* One more attempt if we hit low limit? */
+ /*
+ * Pass 2 (trylock): Evict BOs above the effective low watermark.
+ * Falls back to low-priority eviction if needed.
+ */
+ evict_walk.walk.arg.trylock_only = true;
+ lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
if (!lret && evict_walk.hit_low) {
evict_walk.try_low = true;
lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
}
+
if (lret || !ticket)
goto out;
- /* Reset low limit */
+ /*
+ * Pass 3+ (properly locked): Evict while making progress.
+ * Reset flags and retry with try_low if we hit the low watermark.
+ */
evict_walk.try_low = evict_walk.hit_low = false;
- /* If ticket-locking, repeat while making progress. */
evict_walk.walk.arg.trylock_only = false;
retry:
do {
- /* The walk may clear the evict_walk.walk.ticket field */
evict_walk.walk.arg.ticket = ticket;
evict_walk.evicted = 0;
lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
} while (!lret && evict_walk.evicted);
- /* We hit the low limit? Try once more */
if (!lret && evict_walk.hit_low && !evict_walk.try_low) {
evict_walk.try_low = true;
goto retry;
diff --git a/include/linux/cgroup_dmem.h b/include/linux/cgroup_dmem.h
index dd4869f1d736e..3f7278cb290b3 100644
--- a/include/linux/cgroup_dmem.h
+++ b/include/linux/cgroup_dmem.h
@@ -23,7 +23,7 @@ int dmem_cgroup_try_charge(struct dmem_cgroup_region *region, u64 size,
void dmem_cgroup_uncharge(struct dmem_cgroup_pool_state *pool, u64 size);
bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
struct dmem_cgroup_pool_state *test_pool,
- bool ignore_low, bool *ret_hit_low);
+ bool try_high, bool ignore_low, bool *ret_hit_low);
void dmem_cgroup_pool_state_put(struct dmem_cgroup_pool_state *pool);
#else
@@ -54,8 +54,10 @@ static inline void dmem_cgroup_uncharge(struct dmem_cgroup_pool_state *pool, u64
static inline
bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
struct dmem_cgroup_pool_state *test_pool,
- bool ignore_low, bool *ret_hit_low)
+ bool try_high, bool ignore_low, bool *ret_hit_low)
{
+ if (try_high)
+ return false;
return true;
}
diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c
index 4753a67d0f0f2..4267309e6b01d 100644
--- a/kernel/cgroup/dmem.c
+++ b/kernel/cgroup/dmem.c
@@ -156,6 +156,12 @@ set_resource_low(struct dmem_cgroup_pool_state *pool, u64 val)
page_counter_set_low(&pool->cnt, val);
}
+static void
+set_resource_high(struct dmem_cgroup_pool_state *pool, u64 val)
+{
+ page_counter_set_high(&pool->cnt, val);
+}
+
static void
set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val)
{
@@ -167,6 +173,11 @@ static u64 get_resource_low(struct dmem_cgroup_pool_state *pool)
return pool ? READ_ONCE(pool->cnt.low) : 0;
}
+static u64 get_resource_high(struct dmem_cgroup_pool_state *pool)
+{
+ return pool ? READ_ONCE(pool->cnt.high) : PAGE_COUNTER_MAX;
+}
+
static u64 get_resource_min(struct dmem_cgroup_pool_state *pool)
{
return pool ? READ_ONCE(pool->cnt.min) : 0;
@@ -186,6 +197,7 @@ static void reset_all_resource_limits(struct dmem_cgroup_pool_state *rpool)
{
set_resource_min(rpool, 0);
set_resource_low(rpool, 0);
+ set_resource_high(rpool, PAGE_COUNTER_MAX);
set_resource_max(rpool, PAGE_COUNTER_MAX);
}
@@ -289,10 +301,13 @@ dmem_cgroup_calculate_protection(struct dmem_cgroup_pool_state *limit_pool,
* dmem_cgroup_state_evict_valuable() - Check if we should evict from test_pool
* @limit_pool: The pool for which we hit limits
* @test_pool: The pool for which to test
+ * @try_high: Only evict BOs whose usage exceeds the high limit (first pass)
* @ignore_low: Whether we have to respect low watermarks.
* @ret_hit_low: Pointer to whether it makes sense to consider low watermark.
*
* This function returns true if we can evict from @test_pool, false if not.
+ * When @try_high is set, only pools with usage above their high limit are
+ * evictable, enabling prioritized eviction of over-limit cgroups.
* When returning false and @ignore_low is false, @ret_hit_low may
* be set to true to indicate this function can be retried with @ignore_low
* set to true.
@@ -301,15 +316,26 @@ dmem_cgroup_calculate_protection(struct dmem_cgroup_pool_state *limit_pool,
*/
bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
struct dmem_cgroup_pool_state *test_pool,
- bool ignore_low, bool *ret_hit_low)
+ bool try_high, bool ignore_low, bool *ret_hit_low)
{
struct dmem_cgroup_pool_state *pool = test_pool;
struct page_counter *ctest;
u64 used, min, low;
- /* Can always evict from current pool, despite limits */
- if (limit_pool == test_pool)
+ /*
+ * When the limit-hitting cgroup's own BOs are being considered
+ * in try_high mode, only evict them if their pool exceeds its
+ * own dmem.high limit. For non-try_high mode, maintain the
+ * existing behavior: always evict from the limit-hitting pool.
+ */
+ if (limit_pool == test_pool) {
+ if (try_high && test_pool) {
+ ctest = &test_pool->cnt;
+ used = page_counter_read(ctest);
+ return used > READ_ONCE(ctest->high);
+ }
return true;
+ }
if (limit_pool) {
if (!parent_dmemcs(limit_pool->cs))
@@ -330,10 +356,38 @@ bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
}
ctest = &test_pool->cnt;
+ used = page_counter_read(ctest);
+
+ if (try_high) {
+ struct page_counter *c;
+
+ /*
+ * Walk the page_counter parent chain to check whether any
+ * ancestor cgroup exceeds its dmem.high limit. This prevents
+ * child cgroups from evading the penalty when a parent cgroup
+ * is over its high limit.
+ */
+ if (used <= READ_ONCE(ctest->high)) {
+ for (c = ctest->parent; c; c = c->parent) {
+ if (page_counter_read(c) > READ_ONCE(c->high))
+ break;
+ }
+ if (!c)
+ return false;
+ }
+
+ /*
+ * Respect dmem.min protection: do not evict BOs below the
+ * effective minimum even during the high-priority pass.
+ */
+ dmem_cgroup_calculate_protection(limit_pool, test_pool);
+ min = READ_ONCE(ctest->emin);
+
+ return used > min;
+ }
dmem_cgroup_calculate_protection(limit_pool, test_pool);
- used = page_counter_read(ctest);
min = READ_ONCE(ctest->emin);
if (used <= min)
@@ -835,6 +889,17 @@ static ssize_t dmem_cgroup_region_low_write(struct kernfs_open_file *of,
return dmemcg_limit_write(of, buf, nbytes, off, set_resource_low);
}
+static int dmem_cgroup_region_high_show(struct seq_file *sf, void *v)
+{
+ return dmemcg_limit_show(sf, v, get_resource_high);
+}
+
+static ssize_t dmem_cgroup_region_high_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ return dmemcg_limit_write(of, buf, nbytes, off, set_resource_high);
+}
+
static int dmem_cgroup_region_max_show(struct seq_file *sf, void *v)
{
return dmemcg_limit_show(sf, v, get_resource_max);
@@ -868,6 +933,12 @@ static struct cftype files[] = {
.seq_show = dmem_cgroup_region_low_show,
.flags = CFTYPE_NOT_ON_ROOT,
},
+ {
+ .name = "high",
+ .write = dmem_cgroup_region_high_write,
+ .seq_show = dmem_cgroup_region_high_show,
+ .flags = CFTYPE_NOT_ON_ROOT,
+ },
{
.name = "max",
.write = dmem_cgroup_region_max_write,
---
base-commit: ab5fce87a778cb780a05984a2ca448f2b41aafbf
change-id: 20260519-feature-dmem-high-16997148dc38
Best regards,
--
Qiliang Yuan <realwujing@gmail.com>
^ permalink raw reply related [flat|nested] 7+ messages in thread* Claude review: cgroup/dmem: implement dmem.high soft limit via prioritized eviction
2026-05-31 9:52 [PATCH v6] " Qiliang Yuan
@ 2026-06-04 4:56 ` Claude Code Review Bot
2026-06-04 4:56 ` Claude Code Review Bot
1 sibling, 0 replies; 7+ messages in thread
From: Claude Code Review Bot @ 2026-06-04 4:56 UTC (permalink / raw)
To: dri-devel-reviews
Overall Series Review
Subject: cgroup/dmem: implement dmem.high soft limit via prioritized eviction
Author: Qiliang Yuan <realwujing@gmail.com>
Patches: 2
Reviewed: 2026-06-04T14:56:52.172403
---
This is a single-patch series (v6) adding a `dmem.high` soft limit to the dmem cgroup v2 controller. The concept is sound: rather than only having a hard `max` limit that causes allocation failures, a `high` soft limit causes prioritized eviction of buffer objects belonging to over-limit cgroups. The implementation adds a 3-pass eviction strategy in TTM and a `try_high` parameter to `dmem_cgroup_state_evict_valuable()`.
The patch has gone through 6 revisions, fixing real bugs (NULL dereferences, deadlock risks, stub behavior). The core design -- prioritized eviction via a first pass targeting over-high cgroups -- is a reasonable approach. However, there are several issues ranging from correctness concerns to minor nits.
**Key concerns:**
1. The `try_high` pass respects `dmem.min` but ignores `dmem.low` -- this creates an inconsistency where the high-priority eviction pass is more aggressive than the standard pass with respect to low-watermark protection.
2. The parent-chain walk for the `try_high` check traverses `page_counter` parents, which correspond to cgroup hierarchy parents. This is correct but the walk has no bound and could be expensive in deeply nested hierarchies (minor concern).
3. The `limit_pool == test_pool` + `try_high` path does NOT check `dmem.min` protection, unlike the general `try_high` path -- this is asymmetric.
---
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 7+ messages in thread
* Claude review: cgroup/dmem: implement dmem.high soft limit via prioritized eviction
2026-05-31 9:52 [PATCH v6] " Qiliang Yuan
2026-06-04 4:56 ` Claude review: " Claude Code Review Bot
@ 2026-06-04 4:56 ` Claude Code Review Bot
1 sibling, 0 replies; 7+ messages in thread
From: Claude Code Review Bot @ 2026-06-04 4:56 UTC (permalink / raw)
To: dri-devel-reviews
Patch Review
**TTM changes (drivers/gpu/drm/ttm/ttm_bo.c):**
**Issue 1 (medium): Pass 1 blocking lock without ticket is a no-op, not a trylock concern**
```c
+ evict_walk.walk.arg.trylock_only = !ticket;
+ evict_walk.try_high = true;
lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
+ evict_walk.try_high = false;
+ if (lret)
+ goto out;
```
When `ticket` is available, Pass 1 uses a blocking lock (`trylock_only = false`). When `ticket` is NULL, it falls through to trylock. This is fine for deadlock safety. However, the `if (lret) goto out` means if Pass 1 succeeds (returns positive), we skip Passes 2 and 3 entirely. This is correct behavior -- if we evicted a high-limit BO and got a resource, we're done.
**Issue 2 (minor): hit_low is not reset between Pass 1 and Pass 2**
Pass 1 could set `evict_walk.hit_low` to true (via `dmem_cgroup_state_evict_valuable` returning false with `ret_hit_low` set). This `hit_low` state then carries into Pass 2. In the original code, `hit_low` was only set during what is now Pass 2, so it was clean. Now there's potential for Pass 1 to set `hit_low`, causing Pass 2 to immediately retry with `try_low = true` even if Pass 2 itself didn't hit the low limit. This is likely harmless (an extra attempt isn't wrong, just unnecessary) but is a subtle behavioral change.
Consider resetting `hit_low` before Pass 2:
```c
+ evict_walk.hit_low = false; /* Reset from Pass 1 */
evict_walk.walk.arg.trylock_only = true;
lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
```
**Issue 3 (nit): Comment says "above-elow" (typo in changelog too)**
The commit message and V1->V2 changelog both contain "above-elow" which should be "above-elow" (effective low). This appears to be a pre-existing typo that's been carried forward through versions. The in-code comment at Pass 2 says:
```c
+ * Pass 2 (trylock): Evict BOs above the effective low watermark.
```
This is clear enough.
**dmem cgroup header changes (include/linux/cgroup_dmem.h):**
**The CONFIG_CGROUP_DMEM=n stub looks correct:**
```c
+ if (try_high)
+ return false;
return true;
```
When cgroup support is disabled, there's no high limit to check, so the `try_high` pass should be a no-op (returning false means "don't evict this one in the high pass"). This was the v6 fix and is correct.
**dmem cgroup core changes (kernel/cgroup/dmem.c):**
**Issue 4 (correctness concern): Asymmetric min protection in limit_pool == test_pool path**
```c
+ if (limit_pool == test_pool) {
+ if (try_high && test_pool) {
+ ctest = &test_pool->cnt;
+ used = page_counter_read(ctest);
+ return used > READ_ONCE(ctest->high);
+ }
return true;
+ }
```
When `limit_pool == test_pool` and `try_high` is set, the code checks `used > high` but does NOT check `dmem.min` protection. In contrast, the general `try_high` path later in the function does:
```c
+ dmem_cgroup_calculate_protection(limit_pool, test_pool);
+ min = READ_ONCE(ctest->emin);
+ return used > min;
```
This asymmetry means: for the cgroup that hit the limit, even if its usage is below `emin`, it will still be evicted in the high pass as long as `used > high`. This might be intentional (the cgroup that caused the limit hit should be penalized regardless of min protection), but it's undocumented and could surprise users who set both `min` and `high`.
**Issue 5 (minor concern): try_high path ignores low watermark**
The general `try_high` code block checks `dmem.min` but not `dmem.low`:
```c
+ dmem_cgroup_calculate_protection(limit_pool, test_pool);
+ min = READ_ONCE(ctest->emin);
+ return used > min;
```
The non-try_high path checks both `min` and `low`. This means the high-priority eviction pass is more aggressive than the normal pass -- it will evict BOs that are between `emin` and `elow`. This is probably intentional (the point is to aggressively reclaim from over-high cgroups), but worth noting.
**Issue 6 (correctness): Parent chain walk uses page_counter parents, not cgroup parents**
```c
+ if (used <= READ_ONCE(ctest->high)) {
+ for (c = ctest->parent; c; c = c->parent) {
+ if (page_counter_read(c) > READ_ONCE(c->high))
+ break;
+ }
+ if (!c)
+ return false;
+ }
```
This walks `page_counter` parents, which correspond to dmem pool parents, which follow the cgroup hierarchy. This is correct -- it checks if any ancestor cgroup exceeds its `high` limit. If the test_pool itself is under its high limit AND no ancestor is over their high limit, return false (don't evict). If any ancestor is over, continue to the min check. This logic is sound.
However, there's a subtlety: if `test_pool->cnt.parent` is NULL (root cgroup), the loop body never executes, `c` is NULL, and the function returns false. For a BO owned by the root cgroup during `try_high`, this means it won't be evicted in Pass 1 (unless its own usage exceeds high). This seems correct -- the root cgroup shouldn't be penalized by the high pass.
**Issue 7 (nit): `set_resource_high` doesn't validate against current usage**
`page_counter_set_high()` is a simple `WRITE_ONCE` with no validation. Unlike `page_counter_set_max()` which can fail, setting `high` below current usage silently succeeds and immediately makes the cgroup eligible for eviction. This is consistent with how memory cgroup `memory.high` works, but the commit message doesn't mention this behavior.
**Issue 8 (style): The cftype entry placement is good**
```c
+ {
+ .name = "high",
+ .write = dmem_cgroup_region_high_write,
+ .seq_show = dmem_cgroup_region_high_show,
+ .flags = CFTYPE_NOT_ON_ROOT,
+ },
```
Placed between `low` and `max` which matches the natural ordering (min < low < high < max). Correct flags (`CFTYPE_NOT_ON_ROOT` matches the other limit files).
**Issue 9 (documentation): No documentation update**
There's no update to Documentation/ for the new `dmem.high` control file. The cgroup v2 documentation should describe the semantics of this new knob, especially since it has a non-trivial interaction with `min`, `low`, and `max`.
**Overall assessment:** The patch is functionally sound and addresses a real need. The v6 NULL-check fix is correct. The main concerns are:
- `hit_low` not being reset between Pass 1 and Pass 2 (minor but could cause unnecessary retries)
- Asymmetric `min` protection in the `limit_pool == test_pool` path (design question, should be documented/justified)
- Missing documentation for the new cgroup control file
- The patch should consider whether it needs an `events` file or similar for observability (how does an admin know evictions happened due to `high`?)
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v4] cgroup/dmem: implement dmem.high soft limit via prioritized eviction
@ 2026-05-30 7:35 Qiliang Yuan
2026-06-04 5:52 ` Claude review: " Claude Code Review Bot
2026-06-04 5:52 ` Claude Code Review Bot
0 siblings, 2 replies; 7+ messages in thread
From: Qiliang Yuan @ 2026-05-30 7:35 UTC (permalink / raw)
To: Christian Koenig, Huang Rui, Matthew Auld, Matthew Brost,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
Simona Vetter, Tejun Heo, Johannes Weiner, Michal Koutný,
Natalie Vock
Cc: dri-devel, linux-kernel, cgroups, Qiliang Yuan
The dmem cgroup v2 controller currently only provides a hard "max"
limit, which causes immediate allocation failures when a cgroup's
device memory usage reaches its quota. GPU-bound AI workloads need
smoother over-subscription support: a soft limit that temporarily
allows excess usage while applying backpressure through reclaim
rather than outright failure.
Add dmem.high, a soft limit that penalizes over-limit cgroups by
evicting their buffer objects first when eviction is triggered (e.g.
due to a "max" limit hit). Unlike the rejected v1 approach which
used sleep-on-allocation throttling, this version provides a
meaningful recovery action through prioritized reclaim.
Expose "high" as a new cgroupfs control file per region via
set_resource_high() and get_resource_high(), and initialize it to
PAGE_COUNTER_MAX in reset_all_resource_limits(). Like get_resource_max(),
get_resource_high() returns PAGE_COUNTER_MAX when the pool is NULL.
Extend dmem_cgroup_state_evict_valuable() with a "try_high"
parameter. When set, the function evaluates the try_high condition
first (before the limit_pool == test_pool shortcut) so that even the
limit-hitting cgroup's own BOs are filtered by the high threshold.
It then walks the page_counter parent chain to check whether any
ancestor exceeds its high limit, and verifies that the pool is above
its effective minimum to respect dmem.min protection.
Refactor ttm_bo_evict_alloc() into a 3-pass eviction strategy.
Pass 1 uses a blocking lock and targets only BOs whose cgroup exceeds
dmem.high, ensuring over-limit cgroups are penalized even when their
BOs are actively in use. Pass 2 falls back to the standard above-elow
trylock eviction. Pass 3+ uses proper locking and repeats while
making progress with the existing low-watermark fallback.
Signed-off-by: Qiliang Yuan <realwujing@gmail.com>
---
Introduce a "high" soft limit for the dmem cgroup v2 controller.
When a "max" limit is hit and eviction is triggered, buffer objects
belonging to cgroups that exceed their dmem.high limit are targeted
first, providing a meaningful recovery action through reclaim.
The dmem cgroup currently only supports hard "max" limits, which
cause immediate allocation failures for GPU-bound workloads. A soft
limit enables smoother over-subscription by penalizing over-limit
cgroups via prioritized eviction rather than outright rejection.
The implementation adds a "high" cgroupfs control file per region,
a try_high parameter to dmem_cgroup_state_evict_valuable() for
tier-1 eviction, and a 3-pass strategy in ttm_bo_evict_alloc().
---
V3 -> V4:
- Use a blocking lock in Pass 1 instead of trylock to ensure
over-limit cgroups are penalized even when their BOs are actively
in use, as requested by Maarten Lankhorst.
- Evaluate the try_high condition before the limit_pool == test_pool
early-return so that the limit-hitting cgroup's own BOs are also
filtered by dmem.high.
- Remove the high-priority compensation retry at the start of Pass 3,
which is no longer needed now that Pass 1 uses a blocking lock.
V2 -> V3:
- Walk the page_counter parent chain in the try_high pass to prevent
child cgroups from evading the penalty when a parent cgroup exceeds
its dmem.high limit.
- Check dmem.min protection in the try_high pass to avoid evicting
BOs below the effective minimum.
- Add a properly-locked high-priority retry at the beginning of Pass 3
so that actively-used over-limit BOs (which failed trylock in Pass 1)
are not skipped while innocent cgroups are evicted.
- Fix get_resource_high(NULL) returning 0 instead of PAGE_COUNTER_MAX
to match the behavior of get_resource_max().
V1 -> V2:
- Replace sleep-on-allocation throttling with prioritized eviction.
When a "max" limit is hit, BOs from cgroups exceeding dmem.high are
evicted first in a dedicated pass. No throttling or sleeping is
performed in the charge path.
- Remove task throttling (schedule_timeout_killable, TIF_NOTIFY_RESUME,
resume_user_mode_work() integration) entirely.
- Add dmem.high cgroupfs control file per region.
- Extend dmem_cgroup_state_evict_valuable() with try_high parameter
to target over-limit cgroups as tier-1 eviction.
- Refactor ttm_bo_evict_alloc() into a 3-pass eviction strategy:
(1) trylock: evict only BOs exceeding dmem.high
(2) trylock: above-elow
(3) proper-lock: repeat with low fallback.
- Initialize high to PAGE_COUNTER_MAX in reset_all_resource_limits().
v3: https://lore.kernel.org/r/20260528-feature-dmem-high-v3-1-c642b34bcb2f@gmail.com
v2: https://lore.kernel.org/r/20260522-feature-dmem-high-v2-1-1d7d4a0fa5da@gmail.com
v1: https://lore.kernel.org/all/20260520-feature-dmem-high-v1-1-97ca0cb7f95a@gmail.com
---
drivers/gpu/drm/ttm/ttm_bo.c | 32 +++++++++++++----
include/linux/cgroup_dmem.h | 4 +--
kernel/cgroup/dmem.c | 81 +++++++++++++++++++++++++++++++++++++++++---
3 files changed, 104 insertions(+), 13 deletions(-)
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index bcd76f6bb7f02..bf06e9e4b18a3 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -505,6 +505,8 @@ struct ttm_bo_evict_walk {
/** @limit_pool: Which pool limit we should test against */
struct dmem_cgroup_pool_state *limit_pool;
+ /** @try_high: Whether to only evict BO's above the high watermark (first pass) */
+ bool try_high;
/** @try_low: Whether we should attempt to evict BO's with low watermark threshold */
bool try_low;
/** @hit_low: If we cannot evict a bo when @try_low is false (first pass) */
@@ -518,7 +520,8 @@ static s64 ttm_bo_evict_cb(struct ttm_lru_walk *walk, struct ttm_buffer_object *
s64 lret;
if (!dmem_cgroup_state_evict_valuable(evict_walk->limit_pool, bo->resource->css,
- evict_walk->try_low, &evict_walk->hit_low))
+ evict_walk->try_high, evict_walk->try_low,
+ &evict_walk->hit_low))
return 0;
if (bo->pin_count || !bo->bdev->funcs->eviction_valuable(bo, evict_walk->place))
@@ -577,31 +580,46 @@ static int ttm_bo_evict_alloc(struct ttm_device *bdev,
};
s64 lret;
- evict_walk.walk.arg.trylock_only = true;
+ /*
+ * Pass 1 (blocking, high-priority): Evict only BOs whose cgroup
+ * exceeds its dmem.high soft limit. A blocking lock is used to
+ * ensure over-limit cgroups are penalized even when their BOs are
+ * actively in use.
+ */
+ evict_walk.walk.arg.trylock_only = false;
+ evict_walk.try_high = true;
lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
+ evict_walk.try_high = false;
+ if (lret)
+ goto out;
- /* One more attempt if we hit low limit? */
+ /*
+ * Pass 2 (trylock): Evict BOs above the effective low watermark.
+ * Falls back to low-priority eviction if needed.
+ */
+ lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
if (!lret && evict_walk.hit_low) {
evict_walk.try_low = true;
lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
}
+
if (lret || !ticket)
goto out;
- /* Reset low limit */
+ /*
+ * Pass 3+ (properly locked): Evict while making progress.
+ * Reset flags and retry with try_low if we hit the low watermark.
+ */
evict_walk.try_low = evict_walk.hit_low = false;
- /* If ticket-locking, repeat while making progress. */
evict_walk.walk.arg.trylock_only = false;
retry:
do {
- /* The walk may clear the evict_walk.walk.ticket field */
evict_walk.walk.arg.ticket = ticket;
evict_walk.evicted = 0;
lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
} while (!lret && evict_walk.evicted);
- /* We hit the low limit? Try once more */
if (!lret && evict_walk.hit_low && !evict_walk.try_low) {
evict_walk.try_low = true;
goto retry;
diff --git a/include/linux/cgroup_dmem.h b/include/linux/cgroup_dmem.h
index dd4869f1d736e..06115d35509b1 100644
--- a/include/linux/cgroup_dmem.h
+++ b/include/linux/cgroup_dmem.h
@@ -23,7 +23,7 @@ int dmem_cgroup_try_charge(struct dmem_cgroup_region *region, u64 size,
void dmem_cgroup_uncharge(struct dmem_cgroup_pool_state *pool, u64 size);
bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
struct dmem_cgroup_pool_state *test_pool,
- bool ignore_low, bool *ret_hit_low);
+ bool try_high, bool ignore_low, bool *ret_hit_low);
void dmem_cgroup_pool_state_put(struct dmem_cgroup_pool_state *pool);
#else
@@ -54,7 +54,7 @@ static inline void dmem_cgroup_uncharge(struct dmem_cgroup_pool_state *pool, u64
static inline
bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
struct dmem_cgroup_pool_state *test_pool,
- bool ignore_low, bool *ret_hit_low)
+ bool try_high, bool ignore_low, bool *ret_hit_low)
{
return true;
}
diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c
index 4753a67d0f0f2..f81fbb538cf2f 100644
--- a/kernel/cgroup/dmem.c
+++ b/kernel/cgroup/dmem.c
@@ -156,6 +156,12 @@ set_resource_low(struct dmem_cgroup_pool_state *pool, u64 val)
page_counter_set_low(&pool->cnt, val);
}
+static void
+set_resource_high(struct dmem_cgroup_pool_state *pool, u64 val)
+{
+ page_counter_set_high(&pool->cnt, val);
+}
+
static void
set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val)
{
@@ -167,6 +173,11 @@ static u64 get_resource_low(struct dmem_cgroup_pool_state *pool)
return pool ? READ_ONCE(pool->cnt.low) : 0;
}
+static u64 get_resource_high(struct dmem_cgroup_pool_state *pool)
+{
+ return pool ? READ_ONCE(pool->cnt.high) : PAGE_COUNTER_MAX;
+}
+
static u64 get_resource_min(struct dmem_cgroup_pool_state *pool)
{
return pool ? READ_ONCE(pool->cnt.min) : 0;
@@ -186,6 +197,7 @@ static void reset_all_resource_limits(struct dmem_cgroup_pool_state *rpool)
{
set_resource_min(rpool, 0);
set_resource_low(rpool, 0);
+ set_resource_high(rpool, PAGE_COUNTER_MAX);
set_resource_max(rpool, PAGE_COUNTER_MAX);
}
@@ -289,10 +301,13 @@ dmem_cgroup_calculate_protection(struct dmem_cgroup_pool_state *limit_pool,
* dmem_cgroup_state_evict_valuable() - Check if we should evict from test_pool
* @limit_pool: The pool for which we hit limits
* @test_pool: The pool for which to test
+ * @try_high: Only evict BOs whose usage exceeds the high limit (first pass)
* @ignore_low: Whether we have to respect low watermarks.
* @ret_hit_low: Pointer to whether it makes sense to consider low watermark.
*
* This function returns true if we can evict from @test_pool, false if not.
+ * When @try_high is set, only pools with usage above their high limit are
+ * evictable, enabling prioritized eviction of over-limit cgroups.
* When returning false and @ignore_low is false, @ret_hit_low may
* be set to true to indicate this function can be retried with @ignore_low
* set to true.
@@ -301,12 +316,56 @@ dmem_cgroup_calculate_protection(struct dmem_cgroup_pool_state *limit_pool,
*/
bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
struct dmem_cgroup_pool_state *test_pool,
- bool ignore_low, bool *ret_hit_low)
+ bool try_high, bool ignore_low, bool *ret_hit_low)
{
struct dmem_cgroup_pool_state *pool = test_pool;
struct page_counter *ctest;
u64 used, min, low;
+ ctest = &test_pool->cnt;
+ used = page_counter_read(ctest);
+
+ if (try_high) {
+ /*
+ * When the limit-hitting cgroup's own BOs are being
+ * considered, only evict them if their pool exceeds its
+ * own dmem.high limit. No ancestry check is needed
+ * because the limit was triggered by this pool itself.
+ */
+ if (limit_pool == test_pool)
+ return used > READ_ONCE(ctest->high);
+
+ {
+ struct page_counter *c;
+
+ /*
+ * Walk the page_counter parent chain to check
+ * whether any ancestor cgroup exceeds its
+ * dmem.high limit. This prevents child cgroups
+ * from evading the penalty when a parent cgroup
+ * is over its high limit.
+ */
+ if (used <= READ_ONCE(ctest->high)) {
+ for (c = ctest->parent; c; c = c->parent) {
+ if (page_counter_read(c) >
+ READ_ONCE(c->high))
+ break;
+ }
+ if (!c)
+ return false;
+ }
+ }
+
+ /*
+ * Respect dmem.min protection: do not evict BOs below the
+ * effective minimum even during the high-priority pass.
+ */
+ dmem_cgroup_calculate_protection(limit_pool, test_pool);
+ min = READ_ONCE(ctest->emin);
+
+ return used > min;
+ }
+
/* Can always evict from current pool, despite limits */
if (limit_pool == test_pool)
return true;
@@ -329,11 +388,8 @@ bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
{}
}
- ctest = &test_pool->cnt;
-
dmem_cgroup_calculate_protection(limit_pool, test_pool);
- used = page_counter_read(ctest);
min = READ_ONCE(ctest->emin);
if (used <= min)
@@ -835,6 +891,17 @@ static ssize_t dmem_cgroup_region_low_write(struct kernfs_open_file *of,
return dmemcg_limit_write(of, buf, nbytes, off, set_resource_low);
}
+static int dmem_cgroup_region_high_show(struct seq_file *sf, void *v)
+{
+ return dmemcg_limit_show(sf, v, get_resource_high);
+}
+
+static ssize_t dmem_cgroup_region_high_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ return dmemcg_limit_write(of, buf, nbytes, off, set_resource_high);
+}
+
static int dmem_cgroup_region_max_show(struct seq_file *sf, void *v)
{
return dmemcg_limit_show(sf, v, get_resource_max);
@@ -868,6 +935,12 @@ static struct cftype files[] = {
.seq_show = dmem_cgroup_region_low_show,
.flags = CFTYPE_NOT_ON_ROOT,
},
+ {
+ .name = "high",
+ .write = dmem_cgroup_region_high_write,
+ .seq_show = dmem_cgroup_region_high_show,
+ .flags = CFTYPE_NOT_ON_ROOT,
+ },
{
.name = "max",
.write = dmem_cgroup_region_max_write,
---
base-commit: ab5fce87a778cb780a05984a2ca448f2b41aafbf
change-id: 20260519-feature-dmem-high-16997148dc38
Best regards,
--
Qiliang Yuan <realwujing@gmail.com>
^ permalink raw reply related [flat|nested] 7+ messages in thread* Claude review: cgroup/dmem: implement dmem.high soft limit via prioritized eviction
2026-05-30 7:35 [PATCH v4] " Qiliang Yuan
@ 2026-06-04 5:52 ` Claude Code Review Bot
2026-06-04 5:52 ` Claude Code Review Bot
1 sibling, 0 replies; 7+ messages in thread
From: Claude Code Review Bot @ 2026-06-04 5:52 UTC (permalink / raw)
To: dri-devel-reviews
Overall Series Review
Subject: cgroup/dmem: implement dmem.high soft limit via prioritized eviction
Author: Qiliang Yuan <realwujing@gmail.com>
Patches: 1
Reviewed: 2026-06-04T15:52:23.141901
---
This is a single-patch series (v4) adding a `dmem.high` soft limit to the dmem cgroup v2 controller. The concept is sound: when eviction is triggered (e.g., by a `max` limit hit), BOs from cgroups exceeding their `dmem.high` soft limit are evicted first, providing a prioritized reclaim mechanism rather than the rejected sleep-on-allocation approach from v1.
The cgroupfs plumbing (control file, get/set, initialization) follows existing patterns well. However, the TTM eviction changes have two significant bugs: Pass 2 runs with blocking lock instead of trylock (the `trylock_only` flag is never restored after Pass 1), and the `try_high` path in `dmem_cgroup_state_evict_valuable()` skips the hierarchy check between `limit_pool` and `test_pool`, meaning BOs from completely unrelated cgroups can be evicted if any of their ancestors happens to exceed its high limit.
**Verdict: Needs revision** — the trylock regression and the missing hierarchy check need to be fixed before this can be merged.
---
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 7+ messages in thread
* Claude review: cgroup/dmem: implement dmem.high soft limit via prioritized eviction
2026-05-30 7:35 [PATCH v4] " Qiliang Yuan
2026-06-04 5:52 ` Claude review: " Claude Code Review Bot
@ 2026-06-04 5:52 ` Claude Code Review Bot
1 sibling, 0 replies; 7+ messages in thread
From: Claude Code Review Bot @ 2026-06-04 5:52 UTC (permalink / raw)
To: dri-devel-reviews
Patch Review
#### Bug: Pass 2 uses blocking lock instead of trylock
In `ttm_bo_evict_alloc()`, Pass 1 sets `trylock_only = false` for the blocking high-priority pass, but this is never restored before Pass 2:
```c
+ evict_walk.walk.arg.trylock_only = false;
+ evict_walk.try_high = true;
lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
+ evict_walk.try_high = false;
+ if (lret)
+ goto out;
+
+ /*
+ * Pass 2 (trylock): Evict BOs above the effective low watermark.
+ */
lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
```
The comment says "Pass 2 (trylock)" but `trylock_only` is still `false` from Pass 1. The original code had `trylock_only = true` here. This changes the locking behavior for the standard eviction pass — it will now take blocking locks on BOs that it shouldn't, potentially causing excessive blocking or deadlocks. A `evict_walk.walk.arg.trylock_only = true;` is needed before Pass 2.
#### Bug: try_high path skips limit_pool/test_pool hierarchy check
The existing `dmem_cgroup_state_evict_valuable()` verifies that `test_pool` is a descendant of `limit_pool` before allowing eviction:
```c
for (pool = test_pool; pool && limit_pool != pool; pool = pool_parent(pool))
{}
if (!pool)
return false;
```
The new `try_high` path bypasses this entirely. When `limit_pool != test_pool`, the code goes straight into the ancestor high-limit walk without verifying the hierarchy relationship:
```c
+ if (try_high) {
+ if (limit_pool == test_pool)
+ return used > READ_ONCE(ctest->high);
+
+ {
+ struct page_counter *c;
+ if (used <= READ_ONCE(ctest->high)) {
+ for (c = ctest->parent; c; c = c->parent) {
+ if (page_counter_read(c) >
+ READ_ONCE(c->high))
+ break;
+ }
+ if (!c)
+ return false;
+ }
+ }
```
This means if cgroup A hits its `max` limit and triggers eviction, a BO from completely unrelated cgroup B can be evicted if B (or any of B's ancestors) exceeds its high limit. This is unfair — A's over-allocation shouldn't cause eviction from an unrelated cgroup tree.
Furthermore, when the hierarchy isn't verified, the subsequent call to `dmem_cgroup_calculate_protection(limit_pool, test_pool)` iterates `limit_pool`'s descendants looking for `test_pool`, won't find it, and returns without updating `test_pool->cnt.emin`. The `used > min` check then reads a stale `emin` value.
If the intent is to only evict over-high descendants of the limit-hitting cgroup, the hierarchy check from the non-try_high path should be included. If the intent is truly global (evict any over-high BO regardless of hierarchy), that's a design decision that should be explicitly documented, and the `dmem_cgroup_calculate_protection` call needs to use the correct root for test_pool.
#### Style: Bare block for variable declaration
```c
+ {
+ struct page_counter *c;
+ ...
+ }
```
This nested scope block is unusual for kernel style. The variable `c` could be declared at the top of the `if (try_high)` block or at function scope alongside the existing local variables.
#### Commit message typo
The commit message says "above-elow" in two places — should be "above-low":
> Pass 2 falls back to the standard above-elow trylock eviction.
#### Minor: cgroupfs plumbing looks correct
The `set_resource_high()`, `get_resource_high()`, `reset_all_resource_limits()` addition, and the `cftype` file registration all follow the established pattern for `min`/`low`/`max` correctly. The `get_resource_high(NULL)` returning `PAGE_COUNTER_MAX` (matching `get_resource_max()` behavior) is the right default.
#### Missing: No documentation update
The kernel typically documents cgroup control files in `Documentation/admin-guide/cgroup-v2.rst`. A new user-facing control file like `dmem.high` should have corresponding documentation describing its semantics, default value, and interaction with `dmem.max`, `dmem.low`, and `dmem.min`.
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2026-06-04 5:52 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-31 8:45 [PATCH v5] cgroup/dmem: implement dmem.high soft limit via prioritized eviction Qiliang Yuan
2026-06-04 5:03 ` Claude review: " Claude Code Review Bot
2026-06-04 5:03 ` Claude Code Review Bot
-- strict thread matches above, loose matches on Subject: below --
2026-05-31 9:52 [PATCH v6] " Qiliang Yuan
2026-06-04 4:56 ` Claude review: " Claude Code Review Bot
2026-06-04 4:56 ` Claude Code Review Bot
2026-05-30 7:35 [PATCH v4] " Qiliang Yuan
2026-06-04 5:52 ` Claude review: " Claude Code Review Bot
2026-06-04 5:52 ` Claude Code Review Bot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox