* [PATCH] cgroup/dmem: implement dmem.high soft limit and throttling
@ 2026-05-20 6:07 Qiliang Yuan
2026-05-20 9:52 ` Tejun Heo
` (4 more replies)
0 siblings, 5 replies; 9+ messages in thread
From: Qiliang Yuan @ 2026-05-20 6:07 UTC (permalink / raw)
To: Maarten Lankhorst, Maxime Ripard, Natalie Vock, Tejun Heo,
Johannes Weiner, Michal Koutný
Cc: cgroups, dri-devel, linux-kernel, Qiliang Yuan
Introduce the "high" soft limit for the dmem cgroup v2 controller.
When a cgroup's device memory usage exceeds its high limit, tasks
belonging to that cgroup are throttled by being forced into a sleep
before returning to user space, instead of being failed outright
as with the "max" limit.
Key changes:
- Add high counter configuration to dmem_cgroup_pool.
- Add over-high check in the try_charge path and set TIF_NOTIFY_RESUME.
- Inject the dmem throttling handler into resume_user_mode_work.
- Implement the handler to perform a 100ms interruptible sleep for
over-limit tasks.
This mechanism provides smoother over-subscription support for device
memory resources.
Signed-off-by: Qiliang Yuan <realwujing@gmail.com>
---
This series introduces the "high" soft limit and associated task
throttling mechanism to the dmem cgroup v2 controller.
The device memory (VRAM) management currently only supports hard limits
(max), which leads to immediate allocation failures when reached. This
can be disruptive for GPU-bound AI workloads. By introducing a soft
limit, we allow cgroups to exceed their quota temporarily while
applying backpressure via task throttling before the process returns
to user space.
The mechanism is inspired by the memory cgroup's high limit:
- When usage > high, the task is marked with TIF_NOTIFY_RESUME.
- Upon returning to user space, it triggers a 100ms sleep.
- This provides a smoother over-subscription model for GPU resources.
Qiliang Yuan (1):
cgroup/dmem: implement dmem.high soft limit and throttling
---
To: Maarten Lankhorst <dev@lankhorst.se>
To: Maxime Ripard <mripard@kernel.org>
To: Natalie Vock <natalie.vock@gmx.de>
To: Tejun Heo <tj@kernel.org>
To: Johannes Weiner <hannes@cmpxchg.org>
To: Michal Koutný <mkoutny@suse.com>
Cc: cgroups@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Cc: linux-kernel@vger.kernel.org
---
include/linux/cgroup_dmem.h | 10 +++++++
include/linux/resume_user_mode.h | 2 ++
kernel/cgroup/dmem.c | 60 +++++++++++++++++++++++++++++++++++++++-
3 files changed, 71 insertions(+), 1 deletion(-)
diff --git a/include/linux/cgroup_dmem.h b/include/linux/cgroup_dmem.h
index dd4869f1d736e..d58972de7c910 100644
--- a/include/linux/cgroup_dmem.h
+++ b/include/linux/cgroup_dmem.h
@@ -21,6 +21,13 @@ int dmem_cgroup_try_charge(struct dmem_cgroup_region *region, u64 size,
struct dmem_cgroup_pool_state **ret_pool,
struct dmem_cgroup_pool_state **ret_limit_pool);
void dmem_cgroup_uncharge(struct dmem_cgroup_pool_state *pool, u64 size);
+void __dmem_cgroup_handle_over_high(void);
+
+static inline void dmem_cgroup_handle_over_high(void)
+{
+ __dmem_cgroup_handle_over_high();
+}
+
bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
struct dmem_cgroup_pool_state *test_pool,
bool ignore_low, bool *ret_hit_low);
@@ -51,6 +58,9 @@ static inline int dmem_cgroup_try_charge(struct dmem_cgroup_region *region, u64
static inline void dmem_cgroup_uncharge(struct dmem_cgroup_pool_state *pool, u64 size)
{ }
+static inline void dmem_cgroup_handle_over_high(void)
+{ }
+
static inline
bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
struct dmem_cgroup_pool_state *test_pool,
diff --git a/include/linux/resume_user_mode.h b/include/linux/resume_user_mode.h
index bf92227c78d0d..afcab20998c41 100644
--- a/include/linux/resume_user_mode.h
+++ b/include/linux/resume_user_mode.h
@@ -8,6 +8,7 @@
#include <linux/memcontrol.h>
#include <linux/rseq.h>
#include <linux/blk-cgroup.h>
+#include <linux/cgroup_dmem.h>
/**
* set_notify_resume - cause resume_user_mode_work() to be called
@@ -58,6 +59,7 @@ static inline void resume_user_mode_work(struct pt_regs *regs)
mem_cgroup_handle_over_high(GFP_KERNEL);
blkcg_maybe_throttle_current();
+ dmem_cgroup_handle_over_high();
rseq_handle_slowpath(regs);
}
diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c
index 4753a67d0f0f2..f77c692b887b1 100644
--- a/kernel/cgroup/dmem.c
+++ b/kernel/cgroup/dmem.c
@@ -15,6 +15,7 @@
#include <linux/page_counter.h>
#include <linux/parser.h>
#include <linux/refcount.h>
+#include <linux/resume_user_mode.h>
#include <linux/rculist.h>
#include <linux/slab.h>
@@ -156,6 +157,12 @@ set_resource_low(struct dmem_cgroup_pool_state *pool, u64 val)
page_counter_set_low(&pool->cnt, val);
}
+static void
+set_resource_high(struct dmem_cgroup_pool_state *pool, u64 val)
+{
+ page_counter_set_high(&pool->cnt, val);
+}
+
static void
set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val)
{
@@ -167,6 +174,11 @@ static u64 get_resource_low(struct dmem_cgroup_pool_state *pool)
return pool ? READ_ONCE(pool->cnt.low) : 0;
}
+static u64 get_resource_high(struct dmem_cgroup_pool_state *pool)
+{
+ return pool ? READ_ONCE(pool->cnt.high) : 0;
+}
+
static u64 get_resource_min(struct dmem_cgroup_pool_state *pool)
{
return pool ? READ_ONCE(pool->cnt.min) : 0;
@@ -186,6 +198,7 @@ static void reset_all_resource_limits(struct dmem_cgroup_pool_state *rpool)
{
set_resource_min(rpool, 0);
set_resource_low(rpool, 0);
+ set_resource_high(rpool, PAGE_COUNTER_MAX);
set_resource_max(rpool, PAGE_COUNTER_MAX);
}
@@ -685,6 +698,9 @@ int dmem_cgroup_try_charge(struct dmem_cgroup_region *region, u64 size,
goto err;
}
+ if (page_counter_read(&pool->cnt) > READ_ONCE(pool->cnt.high))
+ set_notify_resume(current);
+
/* On success, reference from get_current_dmemcs is transferred to *ret_pool */
*ret_pool = pool;
return 0;
@@ -835,13 +851,24 @@ static ssize_t dmem_cgroup_region_low_write(struct kernfs_open_file *of,
return dmemcg_limit_write(of, buf, nbytes, off, set_resource_low);
}
+static int dmem_cgroup_region_high_show(struct seq_file *sf, void *v)
+{
+ return dmemcg_limit_show(sf, v, get_resource_high);
+}
+
+static ssize_t dmem_cgroup_region_high_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ return dmemcg_limit_write(of, buf, nbytes, off, set_resource_high);
+}
+
static int dmem_cgroup_region_max_show(struct seq_file *sf, void *v)
{
return dmemcg_limit_show(sf, v, get_resource_max);
}
static ssize_t dmem_cgroup_region_max_write(struct kernfs_open_file *of,
- char *buf, size_t nbytes, loff_t off)
+ char *buf, size_t nbytes, loff_t off)
{
return dmemcg_limit_write(of, buf, nbytes, off, set_resource_max);
}
@@ -868,6 +895,12 @@ static struct cftype files[] = {
.seq_show = dmem_cgroup_region_low_show,
.flags = CFTYPE_NOT_ON_ROOT,
},
+ {
+ .name = "high",
+ .write = dmem_cgroup_region_high_write,
+ .seq_show = dmem_cgroup_region_high_show,
+ .flags = CFTYPE_NOT_ON_ROOT,
+ },
{
.name = "max",
.write = dmem_cgroup_region_max_write,
@@ -877,6 +910,31 @@ static struct cftype files[] = {
{ } /* Zero entry terminates. */
};
+void __dmem_cgroup_handle_over_high(void)
+{
+ struct dmemcg_state *dmemcs;
+ struct dmem_cgroup_pool_state *pool;
+
+ dmemcs = css_to_dmemcs(task_get_css(current, dmem_cgrp_id));
+ if (!dmemcs)
+ return;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(pool, &dmemcs->pools, css_node) {
+ unsigned long usage, high;
+
+ usage = page_counter_read(&pool->cnt);
+ high = READ_ONCE(pool->cnt.high);
+
+ if (usage > high)
+ schedule_timeout_killable(HZ / 10);
+ }
+ rcu_read_unlock();
+
+ css_put(&dmemcs->css);
+}
+EXPORT_SYMBOL_GPL(__dmem_cgroup_handle_over_high);
+
struct cgroup_subsys dmem_cgrp_subsys = {
.css_alloc = dmemcs_alloc,
.css_free = dmemcs_free,
---
base-commit: ab5fce87a778cb780a05984a2ca448f2b41aafbf
change-id: 20260519-feature-dmem-high-16997148dc38
Best regards,
--
Qiliang Yuan <realwujing@gmail.com>
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] cgroup/dmem: implement dmem.high soft limit and throttling
2026-05-20 6:07 [PATCH] cgroup/dmem: implement dmem.high soft limit and throttling Qiliang Yuan
@ 2026-05-20 9:52 ` Tejun Heo
2026-05-21 11:28 ` Qiliang Yuan
2026-05-21 9:45 ` Maarten Lankhorst
` (3 subsequent siblings)
4 siblings, 1 reply; 9+ messages in thread
From: Tejun Heo @ 2026-05-20 9:52 UTC (permalink / raw)
To: Qiliang Yuan
Cc: Maarten Lankhorst, Maxime Ripard, Natalie Vock, Johannes Weiner,
Michal Koutný, cgroups, dri-devel, linux-kernel
Hello,
On Wed, May 20, 2026 at 02:07:20PM +0800, Qiliang Yuan wrote:
...
> This series introduces the "high" soft limit and associated task
> throttling mechanism to the dmem cgroup v2 controller.
>
> The device memory (VRAM) management currently only supports hard limits
> (max), which leads to immediate allocation failures when reached. This
> can be disruptive for GPU-bound AI workloads. By introducing a soft
> limit, we allow cgroups to exceed their quota temporarily while
> applying backpressure via task throttling before the process returns
> to user space.
>
> The mechanism is inspired by the memory cgroup's high limit:
> - When usage > high, the task is marked with TIF_NOTIFY_RESUME.
> - Upon returning to user space, it triggers a 100ms sleep.
> - This provides a smoother over-subscription model for GPU resources.
I'm not sure about complicating dmem control model without implementing
reclaim. What are we slowing them down for if the only recovery action is
killing them?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] cgroup/dmem: implement dmem.high soft limit and throttling
2026-05-20 6:07 [PATCH] cgroup/dmem: implement dmem.high soft limit and throttling Qiliang Yuan
2026-05-20 9:52 ` Tejun Heo
@ 2026-05-21 9:45 ` Maarten Lankhorst
2026-05-21 11:28 ` Qiliang Yuan
2026-05-21 10:52 ` Natalie Vock
` (2 subsequent siblings)
4 siblings, 1 reply; 9+ messages in thread
From: Maarten Lankhorst @ 2026-05-21 9:45 UTC (permalink / raw)
To: Qiliang Yuan, Maxime Ripard, Natalie Vock, Tejun Heo,
Johannes Weiner, Michal Koutný
Cc: cgroups, dri-devel, linux-kernel
Hello Qiliang,
Den 2026-05-20 kl. 08:07, skrev Qiliang Yuan:
> Introduce the "high" soft limit for the dmem cgroup v2 controller.
> When a cgroup's device memory usage exceeds its high limit, tasks
> belonging to that cgroup are throttled by being forced into a sleep
> before returning to user space, instead of being failed outright
> as with the "max" limit.
>
> Key changes:
> - Add high counter configuration to dmem_cgroup_pool.
> - Add over-high check in the try_charge path and set TIF_NOTIFY_RESUME.
> - Inject the dmem throttling handler into resume_user_mode_work.
> - Implement the handler to perform a 100ms interruptible sleep for
> over-limit tasks.
>
> This mechanism provides smoother over-subscription support for device
> memory resources.
>
> Signed-off-by: Qiliang Yuan <realwujing@gmail.com>
> ---
> This series introduces the "high" soft limit and associated task
> throttling mechanism to the dmem cgroup v2 controller.
>
> The device memory (VRAM) management currently only supports hard limits
> (max), which leads to immediate allocation failures when reached. This
> can be disruptive for GPU-bound AI workloads. By introducing a soft
> limit, we allow cgroups to exceed their quota temporarily while
> applying backpressure via task throttling before the process returns
> to user space.
>
> The mechanism is inspired by the memory cgroup's high limit:
> - When usage > high, the task is marked with TIF_NOTIFY_RESUME.
> - Upon returning to user space, it triggers a 100ms sleep.
> - This provides a smoother over-subscription model for GPU resources.
>
> Qiliang Yuan (1):
>
> cgroup/dmem: implement dmem.high soft limit and throttling
> ---
> To: Maarten Lankhorst <dev@lankhorst.se>
> To: Maxime Ripard <mripard@kernel.org>
> To: Natalie Vock <natalie.vock@gmx.de>
> To: Tejun Heo <tj@kernel.org>
> To: Johannes Weiner <hannes@cmpxchg.org>
> To: Michal Koutný <mkoutny@suse.com>
> Cc: cgroups@vger.kernel.org
> Cc: dri-devel@lists.freedesktop.org
> Cc: linux-kernel@vger.kernel.org
> ---
I think the concept of allowing userspace to throttle on high
is interesting.
It's the approach I'm more worried about. I believe that it's
better if we punish exceeding their high limit by preferentially
evicting those.
It would make eviction run in 3 passes on the affected cgroup tree:
- Round 1: Clients above their 'high' limit
- Round 2: Clients above their 'low/min' limits
- Round 3: Clients at or below their 'low' limit
And the same client's cgroup, below 'min' limit as well.
I'm open for other ideas as well. Perhaps a flag that would allow
allocation or binding to an address space to fail if it would need
to evict, or a notification sent to the affected client that they
went over high.
Have you tried any other approaches before this one?
Kind regards,
~Maarten Lankhorst
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] cgroup/dmem: implement dmem.high soft limit and throttling
2026-05-20 6:07 [PATCH] cgroup/dmem: implement dmem.high soft limit and throttling Qiliang Yuan
2026-05-20 9:52 ` Tejun Heo
2026-05-21 9:45 ` Maarten Lankhorst
@ 2026-05-21 10:52 ` Natalie Vock
2026-05-21 11:28 ` Qiliang Yuan
2026-05-25 12:14 ` Claude review: " Claude Code Review Bot
2026-05-25 12:14 ` Claude Code Review Bot
4 siblings, 1 reply; 9+ messages in thread
From: Natalie Vock @ 2026-05-21 10:52 UTC (permalink / raw)
To: Qiliang Yuan, Maarten Lankhorst, Maxime Ripard, Tejun Heo,
Johannes Weiner, Michal Koutný
Cc: cgroups, dri-devel, linux-kernel
On 5/20/26 08:07, Qiliang Yuan wrote:
> Introduce the "high" soft limit for the dmem cgroup v2 controller.
> When a cgroup's device memory usage exceeds its high limit, tasks
> belonging to that cgroup are throttled by being forced into a sleep
> before returning to user space, instead of being failed outright
> as with the "max" limit.
>
> Key changes:
> - Add high counter configuration to dmem_cgroup_pool.
> - Add over-high check in the try_charge path and set TIF_NOTIFY_RESUME.
> - Inject the dmem throttling handler into resume_user_mode_work.
> - Implement the handler to perform a 100ms interruptible sleep for
> over-limit tasks.
Interesting proposal, but inserting sleeps on allocation is never a good
idea and doesn't work like you might think it does. In graphics driver
land, lots of random things may result in buffer allocation functions
being called. Whenever TTM determines some buffer needs to be physically
moved (most often during VRAM contention, but also as a result of
pinning buffers for scanout, etc etc), dmem cgroup pools are
charged/uncharged in accordance with the change in buffer residency.
Sleeping in a charge/uncharge path means that in the worst case, a task
will be put to sleep over and over again for exceeding its high limit
just once.
Most critically, submit ioctls typically go over the task's entire
working set and call ttm_bo_validate() to make sure the buffer is
accessible by the GPU, since paging things in on fault is not available
in many consumer GPUs. Your approach could lead to every single
submission sleeping for at least 100ms, thus permanently destroying
performance.
Maarten's suggestion of preferentially evicting memory that is over the
high limit sounds like a better approach.
(Also, did you use AI for this? Please disclose your AI usage as per
kernel guidelines if so.)
Best,
Natalie
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] cgroup/dmem: implement dmem.high soft limit and throttling
2026-05-21 10:52 ` Natalie Vock
@ 2026-05-21 11:28 ` Qiliang Yuan
0 siblings, 0 replies; 9+ messages in thread
From: Qiliang Yuan @ 2026-05-21 11:28 UTC (permalink / raw)
To: natalie.vock
Cc: dev, mripard, tj, hannes, mkoutny, cgroups, dri-devel,
linux-kernel
Hi Natalie,
On Thu, May 21, 2026 at 10:52 AM Natalie Vock wrote:
> Interesting proposal, but inserting sleeps on allocation is never a good
> idea and doesn't work like you might think it does. In graphics driver
> land, lots of random things may result in buffer allocation functions
> being called.
[...]
> Your approach could lead to every single
> submission sleeping for at least 100ms, thus permanently destroying
> performance.
Thank you very much for the detailed explanation of the impact on TTM and
Submit IOCTLs. You are absolutely right—injecting sleeps into the charge
path, which is hit frequently during buffer validation and residency changes,
would indeed be catastrophic for GPU performance.
> Maarten's suggestion of preferentially evicting memory that is over the
> high limit sounds like a better approach.
I agree. Blocking the submission pipeline is not the right way to apply
backpressure. I will abandon the current sleep-on-allocation approach and
focus on implemented prioritized eviction as you and Maarten suggested.
This ensures that reaching the "high" limit triggers a meaningful reclaim
action rather than just stalling the GPU pipeline.
Best regards,
Qiliang
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] cgroup/dmem: implement dmem.high soft limit and throttling
2026-05-21 9:45 ` Maarten Lankhorst
@ 2026-05-21 11:28 ` Qiliang Yuan
0 siblings, 0 replies; 9+ messages in thread
From: Qiliang Yuan @ 2026-05-21 11:28 UTC (permalink / raw)
To: dev
Cc: cgroups, dri-devel, hannes, linux-kernel, mkoutny, mripard,
natalie.vock, tj
Hello Maarten,
On Thu, May 21, 2026 at 09:45 AM Maarten Lankhorst wrote:
> It's the approach I'm more worried about. I believe that it's
> better if we punish exceeding their high limit by preferentially
> evicting those.
>
> It would make eviction run in 3 passes on the affected cgroup tree:
> - Round 1: Clients above their 'high' limit
> - Round 2: Clients above their 'low/min' limits
> - Round 3: Clients at or below their 'low' limit
Thank you for this concrete suggestion. This 3-pass eviction model is
exactly what's needed to make the dmem soft limit effective.
It addresses the core problem of providing a viable "recovery action" when
the limit is reached. By integrating these thresholds directly into the
TTM/dmem eviction weight calculation, we can achieve a more natural
over-subscription model.
I will rework the series for v2 to incorporate this hierarchy-aware
eviction logic.
Kind regards,
~Qiliang
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] cgroup/dmem: implement dmem.high soft limit and throttling
2026-05-20 9:52 ` Tejun Heo
@ 2026-05-21 11:28 ` Qiliang Yuan
0 siblings, 0 replies; 9+ messages in thread
From: Qiliang Yuan @ 2026-05-21 11:28 UTC (permalink / raw)
To: tj
Cc: dev, mripard, natalie.vock, hannes, mkoutny, cgroups, dri-devel,
linux-kernel
Hello Tejun,
On Wed, May 20, 2026 at 09:52 AM Tejun Heo wrote:
> I'm not sure about complicating dmem control model without implementing
> reclaim. What are we slowing them down for if the only recovery action is
> killing them?
Thank you for the feedback. Your point about the lack of a reclaim path
is well-taken. Simple throttling without a way to recover resources is
indeed incomplete and inconsistent with the cgroup v2 philosophy.
To address this from several perspectives in v2:
1. Recovery Path: As suggested by Maarten Lankhorst, we will pivot to a
reclaim-centric model. Exceeding `dmem.high` will trigger a prioritized
eviction process, where memory objects from over-limit cgroups are
targeted first for reclaim. This provides the meaningful "recovery action"
you mentioned.
2. Backpressure: Throttling will then serve as a secondary tool to
synchronize user-space demand with the kernel's reclaim speed, preventing
bursty workloads from overwhelming the system before reclaim can finish.
3. Graceful Degradation: For GPU compute jobs, this model provides a
managed "pressure point" that allows transient peaks to be handled via
rebalancing rather than immediate, fatal allocation failures (max/OOM).
The goal for v2 is to achieve convergence with the `memory.high` model,
pairing prioritized reclaim with backpressure.
Thanks,
Qiliang
^ permalink raw reply [flat|nested] 9+ messages in thread
* Claude review: cgroup/dmem: implement dmem.high soft limit and throttling
2026-05-20 6:07 [PATCH] cgroup/dmem: implement dmem.high soft limit and throttling Qiliang Yuan
` (3 preceding siblings ...)
2026-05-25 12:14 ` Claude review: " Claude Code Review Bot
@ 2026-05-25 12:14 ` Claude Code Review Bot
4 siblings, 0 replies; 9+ messages in thread
From: Claude Code Review Bot @ 2026-05-25 12:14 UTC (permalink / raw)
To: dri-devel-reviews
Overall Series Review
Subject: cgroup/dmem: implement dmem.high soft limit and throttling
Author: Qiliang Yuan <realwujing@gmail.com>
Patches: 7
Reviewed: 2026-05-25T22:14:45.117248
---
This single patch adds a `dmem.high` soft limit to the dmem cgroup controller, modeled after the memory cgroup's `memory.high`. The idea is sound — throttling via return-to-userspace sleep instead of outright allocation failure provides better backpressure for GPU memory. However, the implementation has a **critical sleeping-inside-RCU bug** that will crash the kernel, a **wrong default for the high limit getter**, and several design-level gaps compared to the memcg model it claims to follow (no fast-path guard, no hierarchy walk for high checking). As written, this patch cannot be applied.
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 9+ messages in thread
* Claude review: cgroup/dmem: implement dmem.high soft limit and throttling
2026-05-20 6:07 [PATCH] cgroup/dmem: implement dmem.high soft limit and throttling Qiliang Yuan
` (2 preceding siblings ...)
2026-05-21 10:52 ` Natalie Vock
@ 2026-05-25 12:14 ` Claude Code Review Bot
2026-05-25 12:14 ` Claude Code Review Bot
4 siblings, 0 replies; 9+ messages in thread
From: Claude Code Review Bot @ 2026-05-25 12:14 UTC (permalink / raw)
To: dri-devel-reviews
Patch Review
**Critical: Sleeping inside RCU read-side critical section**
The `__dmem_cgroup_handle_over_high()` function calls `schedule_timeout_killable()` while holding `rcu_read_lock()`. This is illegal — RCU read-side critical sections must not sleep, and this will trigger a `BUG: sleeping function called from invalid context` or similar splat on any kernel with debug options enabled, and cause RCU stalls on production kernels.
```c
rcu_read_lock();
list_for_each_entry_rcu(pool, &dmemcs->pools, css_node) {
unsigned long usage, high;
usage = page_counter_read(&pool->cnt);
high = READ_ONCE(pool->cnt.high);
if (usage > high)
schedule_timeout_killable(HZ / 10);
}
rcu_read_unlock();
```
You need to either collect the over-high state under RCU and sleep after `rcu_read_unlock()`, or use a different locking strategy. Look at how `mem_cgroup_handle_over_high` structures its work — it does not hold RCU across the sleep.
**Bug: `get_resource_high` returns 0 for NULL pool**
```c
static u64 get_resource_high(struct dmem_cgroup_pool_state *pool)
{
return pool ? READ_ONCE(pool->cnt.high) : 0;
}
```
When a cgroup has no pool for a particular region, the default should be `PAGE_COUNTER_MAX` (no limit), not `0`. Returning 0 means `dmem.high` will show `0` for regions where no pool exists, which is semantically wrong. Compare with `get_resource_max()` which correctly returns `PAGE_COUNTER_MAX` for NULL pools.
**Design issue: No fast-path guard in the inline wrapper**
The memcg version guards the expensive path with a per-task flag:
```c
static inline void mem_cgroup_handle_over_high(gfp_t gfp_mask)
{
if (unlikely(current->memcg_nr_pages_over_high))
__mem_cgroup_handle_over_high(gfp_mask);
}
```
This patch's inline wrapper is unconditional:
```c
static inline void dmem_cgroup_handle_over_high(void)
{
__dmem_cgroup_handle_over_high();
}
```
This means **every** return-to-userspace path will call `task_get_css()` + iterate pools + `css_put()`, even for tasks that have never touched device memory. This is a hot path — it should have a cheap early-exit. Consider adding a per-task flag (similar to `memcg_nr_pages_over_high`) or at minimum a static key that is only enabled when any dmem region is registered.
**Design issue: High limit check does not walk the hierarchy**
In `dmem_cgroup_try_charge`, the high check only looks at the leaf pool:
```c
if (page_counter_read(&pool->cnt) > READ_ONCE(pool->cnt.high))
set_notify_resume(current);
```
If a parent cgroup sets a high limit but the child does not, the child's `pool->cnt.high` remains at `PAGE_COUNTER_MAX` and the parent's limit is never checked. The memcg implementation walks the hierarchy for high limit enforcement. You should walk from the leaf to the root, checking each ancestor's `high` against its own `usage`.
**Nit: Gratuitous whitespace change**
The patch re-aligns `dmem_cgroup_region_max_write` parameter indentation without any functional change:
```c
static ssize_t dmem_cgroup_region_max_write(struct kernfs_open_file *of,
- char *buf, size_t nbytes, loff_t off)
+ char *buf, size_t nbytes, loff_t off)
```
This should be dropped — unrelated formatting changes in functional patches add review noise.
**Minor: EXPORT_SYMBOL_GPL may be unnecessary**
`__dmem_cgroup_handle_over_high` is called only from the inline in the header, which is called from `resume_user_mode_work()` in arch entry code — always built-in, never a module. The export appears unnecessary.
**Missing: No event counter or statistics**
The memcg high limit has associated event counters (`MEMCG_HIGH`) that can be observed via `memory.events`. Without analogous accounting, users have no way to observe how often throttling fires, making the feature difficult to tune in practice. This isn't a blocker for v1, but worth considering.
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2026-05-25 12:14 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-20 6:07 [PATCH] cgroup/dmem: implement dmem.high soft limit and throttling Qiliang Yuan
2026-05-20 9:52 ` Tejun Heo
2026-05-21 11:28 ` Qiliang Yuan
2026-05-21 9:45 ` Maarten Lankhorst
2026-05-21 11:28 ` Qiliang Yuan
2026-05-21 10:52 ` Natalie Vock
2026-05-21 11:28 ` Qiliang Yuan
2026-05-25 12:14 ` Claude review: " Claude Code Review Bot
2026-05-25 12:14 ` Claude Code Review Bot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox