* [PATCH 0/2] cgroup/dmem: introduce a peak file
@ 2026-05-06 11:58 Thadeu Lima de Souza Cascardo
2026-05-06 11:58 ` [PATCH 1/2] mm/page_counter: decouple peak_reset from peak_write Thadeu Lima de Souza Cascardo
` (3 more replies)
0 siblings, 4 replies; 9+ messages in thread
From: Thadeu Lima de Souza Cascardo @ 2026-05-06 11:58 UTC (permalink / raw)
To: Tejun Heo, Johannes Weiner, Michal Koutný, Michal Hocko,
Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
Jonathan Corbet, Shuah Khan, Maarten Lankhorst, Maxime Ripard,
Natalie Vock, Tvrtko Ursulin
Cc: cgroups, linux-kernel, linux-mm, linux-doc, dri-devel,
Thadeu Lima de Souza Cascardo, kernel-dev
Just like we have memory.peak, introduce a dmem.peak, which uses the
page_counter support for that.
It can be written to in order to reset the peak, but different from
memory.peak, which expects any write, dmem.peak expects the region name to
be written to it. That region peak is the one that is reset.
That requires ofp_peak to carry a pointer to the pool that was reset.
Writing a different region name will reset the different region and make
the original region peak get back to its non-reset value.
While at it, we reuse a helper from memcontrol, which we moved to
kernel/cgroup/cgroup.c.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
---
Thadeu Lima de Souza Cascardo (2):
mm/page_counter: decouple peak_reset from peak_write
cgroup/dmem: introduce a peak file
Documentation/admin-guide/cgroup-v2.rst | 10 +++
include/linux/cgroup-defs.h | 7 ++
kernel/cgroup/cgroup.c | 32 ++++++++
kernel/cgroup/dmem.c | 132 ++++++++++++++++++++++++++++++--
mm/memcontrol.c | 42 ++--------
5 files changed, 183 insertions(+), 40 deletions(-)
---
base-commit: 7fd2df204f342fc17d1a0bfcd474b24232fb0f32
change-id: 20260409-dmem_peak-3abc1be95072
Best regards,
--
Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 1/2] mm/page_counter: decouple peak_reset from peak_write
2026-05-06 11:58 [PATCH 0/2] cgroup/dmem: introduce a peak file Thadeu Lima de Souza Cascardo
@ 2026-05-06 11:58 ` Thadeu Lima de Souza Cascardo
2026-05-07 3:37 ` Claude review: " Claude Code Review Bot
2026-05-06 11:58 ` [PATCH 2/2] cgroup/dmem: introduce a peak file Thadeu Lima de Souza Cascardo
` (2 subsequent siblings)
3 siblings, 1 reply; 9+ messages in thread
From: Thadeu Lima de Souza Cascardo @ 2026-05-06 11:58 UTC (permalink / raw)
To: Tejun Heo, Johannes Weiner, Michal Koutný, Michal Hocko,
Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
Jonathan Corbet, Shuah Khan, Maarten Lankhorst, Maxime Ripard,
Natalie Vock, Tvrtko Ursulin
Cc: cgroups, linux-kernel, linux-mm, linux-doc, dri-devel,
Thadeu Lima de Souza Cascardo, kernel-dev
Create a new function of_peak_reset that resets the page_counter peak for a
given writer. This should allow it to be reused by other cgroups.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
---
include/linux/cgroup-defs.h | 6 ++++++
kernel/cgroup/cgroup.c | 32 ++++++++++++++++++++++++++++++++
mm/memcontrol.c | 42 ++++++++----------------------------------
3 files changed, 46 insertions(+), 34 deletions(-)
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index f42563739d2e..a85044cb0553 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -22,6 +22,7 @@
#include <linux/workqueue.h>
#include <linux/bpf-cgroup-defs.h>
#include <linux/psi_types.h>
+#include <linux/page_counter.h>
#ifdef CONFIG_CGROUPS
@@ -868,11 +869,16 @@ struct cgroup_subsys {
extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
extern bool cgroup_enable_per_threadgroup_rwsem;
+#define OFP_PEAK_UNSET (((-1UL)))
+
struct cgroup_of_peak {
unsigned long value;
struct list_head list;
};
+void of_peak_reset(struct cgroup_of_peak *ofp, struct page_counter *pc,
+ struct list_head *watchers);
+
/**
* cgroup_threadgroup_change_begin - threadgroup exclusion for cgroups
* @tsk: target task
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 45c0b1ed687a..9b98a5cccf0e 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1981,6 +1981,38 @@ struct cgroup_of_peak *of_peak(struct kernfs_open_file *of)
return &ctx->peak;
}
+/**
+ * of_peak_reset - reset peak
+ * @ofp: open file context
+ * @pc: counter
+ * @watchers: list of other open file contexts
+ *
+ * This function updates all contexts in @watchers to the new usage of @pc.
+ * If @ofp is not in the list yet, that is, if its value is
+ * %OFP_PEAK_UNSET, it is added to @watchers list.
+ *
+ * A lock must be used to protect @watchers.
+ */
+void of_peak_reset(struct cgroup_of_peak *ofp, struct page_counter *pc,
+ struct list_head *watchers)
+{
+ unsigned long usage;
+ struct cgroup_of_peak *peer_ctx;
+
+ usage = page_counter_read(pc);
+ WRITE_ONCE(pc->local_watermark, usage);
+
+ list_for_each_entry(peer_ctx, watchers, list)
+ if (usage > peer_ctx->value)
+ WRITE_ONCE(peer_ctx->value, usage);
+
+ /* initial write, register watcher */
+ if (ofp->value == OFP_PEAK_UNSET)
+ list_add(&ofp->list, watchers);
+
+ WRITE_ONCE(ofp->value, usage);
+}
+
static void apply_cgroup_root_flags(unsigned int root_flags)
{
if (current->nsproxy->cgroup_ns == &init_cgroup_ns) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c03d4787d466..8754927070d3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4517,8 +4517,6 @@ static u64 memory_current_read(struct cgroup_subsys_state *css,
return (u64)page_counter_read(&memcg->memory) * PAGE_SIZE;
}
-#define OFP_PEAK_UNSET (((-1UL)))
-
static int peak_show(struct seq_file *sf, void *v, struct page_counter *pc)
{
struct cgroup_of_peak *ofp = of_peak(sf->private);
@@ -4563,45 +4561,18 @@ static void peak_release(struct kernfs_open_file *of)
spin_unlock(&memcg->peaks_lock);
}
-static ssize_t peak_write(struct kernfs_open_file *of, char *buf, size_t nbytes,
- loff_t off, struct page_counter *pc,
- struct list_head *watchers)
+static ssize_t memory_peak_write(struct kernfs_open_file *of, char *buf,
+ size_t nbytes, loff_t off)
{
- unsigned long usage;
- struct cgroup_of_peak *peer_ctx;
struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
struct cgroup_of_peak *ofp = of_peak(of);
spin_lock(&memcg->peaks_lock);
-
- usage = page_counter_read(pc);
- WRITE_ONCE(pc->local_watermark, usage);
-
- list_for_each_entry(peer_ctx, watchers, list)
- if (usage > peer_ctx->value)
- WRITE_ONCE(peer_ctx->value, usage);
-
- /* initial write, register watcher */
- if (ofp->value == OFP_PEAK_UNSET)
- list_add(&ofp->list, watchers);
-
- WRITE_ONCE(ofp->value, usage);
+ of_peak_reset(ofp, &memcg->memory, &memcg->memory_peaks);
spin_unlock(&memcg->peaks_lock);
-
return nbytes;
}
-static ssize_t memory_peak_write(struct kernfs_open_file *of, char *buf,
- size_t nbytes, loff_t off)
-{
- struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
-
- return peak_write(of, buf, nbytes, off, &memcg->memory,
- &memcg->memory_peaks);
-}
-
-#undef OFP_PEAK_UNSET
-
static int memory_min_show(struct seq_file *m, void *v)
{
return seq_puts_memcg_tunable(m,
@@ -5611,9 +5582,12 @@ static ssize_t swap_peak_write(struct kernfs_open_file *of, char *buf,
size_t nbytes, loff_t off)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ struct cgroup_of_peak *ofp = of_peak(of);
- return peak_write(of, buf, nbytes, off, &memcg->swap,
- &memcg->swap_peaks);
+ spin_lock(&memcg->peaks_lock);
+ of_peak_reset(ofp, &memcg->swap, &memcg->swap_peaks);
+ spin_unlock(&memcg->peaks_lock);
+ return nbytes;
}
static int swap_high_show(struct seq_file *m, void *v)
--
2.47.3
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 2/2] cgroup/dmem: introduce a peak file
2026-05-06 11:58 [PATCH 0/2] cgroup/dmem: introduce a peak file Thadeu Lima de Souza Cascardo
2026-05-06 11:58 ` [PATCH 1/2] mm/page_counter: decouple peak_reset from peak_write Thadeu Lima de Souza Cascardo
@ 2026-05-06 11:58 ` Thadeu Lima de Souza Cascardo
2026-05-07 3:37 ` Claude review: " Claude Code Review Bot
2026-05-06 13:53 ` [PATCH 0/2] " Michal Koutný
2026-05-07 3:37 ` Claude review: " Claude Code Review Bot
3 siblings, 1 reply; 9+ messages in thread
From: Thadeu Lima de Souza Cascardo @ 2026-05-06 11:58 UTC (permalink / raw)
To: Tejun Heo, Johannes Weiner, Michal Koutný, Michal Hocko,
Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
Jonathan Corbet, Shuah Khan, Maarten Lankhorst, Maxime Ripard,
Natalie Vock, Tvrtko Ursulin
Cc: cgroups, linux-kernel, linux-mm, linux-doc, dri-devel,
Thadeu Lima de Souza Cascardo, kernel-dev
Just like we have memory.peak, introduce a dmem.peak, which uses the
page_counter support for that.
It can be written to in order to reset the peak, but different from
memory.peak, which expects any write, dmem.peak expects the region name to
be written to it. That region peak is the one that is reset.
That requires ofp_peak to carry a pointer to the pool that was reset.
Writing a different region name will reset the different region and make
the original region peak get back to its non-reset value.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
---
Documentation/admin-guide/cgroup-v2.rst | 10 +++
include/linux/cgroup-defs.h | 1 +
kernel/cgroup/dmem.c | 132 ++++++++++++++++++++++++++++++--
3 files changed, 137 insertions(+), 6 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 6efd0095ed99..3ba7ab3a36b3 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2808,6 +2808,16 @@ DMEM Interface Files
The semantics are the same as for the memory cgroup controller, and are
calculated in the same way.
+ dmem.peak
+ A readwrite nested-keyed file that exists on non-root cgroups.
+
+ The max memory usage recorded for the cgroup and its descendants since
+ either the creation of the cgroup or the most recent reset for that FD.
+
+ A write of a region name to this file resets it to the current memory
+ usage for subsequent reads through the same file descriptor for that
+ region.
+
dmem.capacity
A read-only file that describes maximum region capacity.
It only exists on the root cgroup. Not all memory can be
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index a85044cb0553..b536054bd916 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -874,6 +874,7 @@ extern bool cgroup_enable_per_threadgroup_rwsem;
struct cgroup_of_peak {
unsigned long value;
struct list_head list;
+ struct dmem_cgroup_pool_state *pool;
};
void of_peak_reset(struct cgroup_of_peak *ofp, struct page_counter *pc,
diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c
index 1ab1fb47f271..afa380c9839b 100644
--- a/kernel/cgroup/dmem.c
+++ b/kernel/cgroup/dmem.c
@@ -57,6 +57,9 @@ struct dmemcg_state {
struct cgroup_subsys_state css;
struct list_head pools;
+
+ /** @peaks_lock: Protects access to the pools' peaks lists */
+ spinlock_t peaks_lock;
};
struct dmem_cgroup_pool_state {
@@ -72,6 +75,10 @@ struct dmem_cgroup_pool_state {
struct rcu_head rcu;
struct page_counter cnt;
+
+ /* Protected by the dmemcg_state peaks_lock */
+ struct list_head peaks;
+
struct dmem_cgroup_pool_state *parent;
refcount_t ref;
@@ -162,26 +169,45 @@ set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val)
page_counter_set_max(&pool->cnt, val);
}
-static u64 get_resource_low(struct dmem_cgroup_pool_state *pool)
+static u64 get_resource_low(struct seq_file *sf, struct dmem_cgroup_pool_state *pool)
{
return pool ? READ_ONCE(pool->cnt.low) : 0;
}
-static u64 get_resource_min(struct dmem_cgroup_pool_state *pool)
+static u64 get_resource_min(struct seq_file *sf, struct dmem_cgroup_pool_state *pool)
{
return pool ? READ_ONCE(pool->cnt.min) : 0;
}
-static u64 get_resource_max(struct dmem_cgroup_pool_state *pool)
+static u64 get_resource_max(struct seq_file *sf, struct dmem_cgroup_pool_state *pool)
{
return pool ? READ_ONCE(pool->cnt.max) : PAGE_COUNTER_MAX;
}
-static u64 get_resource_current(struct dmem_cgroup_pool_state *pool)
+static u64 get_resource_current(struct seq_file *sf, struct dmem_cgroup_pool_state *pool)
{
return pool ? page_counter_read(&pool->cnt) : 0;
}
+static u64 get_resource_peak(struct seq_file *sf, struct dmem_cgroup_pool_state *pool)
+{
+ struct cgroup_of_peak *ofp = of_peak(sf->private);
+ u64 fd_peak, peak;
+ struct dmem_cgroup_pool_state *of_pool;
+
+ if (!pool)
+ return 0;
+
+ of_pool = READ_ONCE(ofp->pool);
+
+ fd_peak = READ_ONCE(ofp->value);
+ if (of_pool != pool || fd_peak == OFP_PEAK_UNSET)
+ peak = pool->cnt.watermark;
+ else
+ peak = max(fd_peak, READ_ONCE(pool->cnt.local_watermark));
+ return peak;
+}
+
static void reset_all_resource_limits(struct dmem_cgroup_pool_state *rpool)
{
set_resource_min(rpool, 0);
@@ -227,6 +253,7 @@ dmemcs_alloc(struct cgroup_subsys_state *parent_css)
return ERR_PTR(-ENOMEM);
INIT_LIST_HEAD(&dmemcs->pools);
+ spin_lock_init(&dmemcs->peaks_lock);
return &dmemcs->css;
}
@@ -377,6 +404,7 @@ alloc_pool_single(struct dmemcg_state *dmemcs, struct dmem_cgroup_region *region
ppool ? &ppool->cnt : NULL, true);
reset_all_resource_limits(pool);
refcount_set(&pool->ref, 1);
+ INIT_LIST_HEAD(&pool->peaks);
kref_get(®ion->ref);
if (ppool && !pool->parent) {
pool->parent = ppool;
@@ -784,7 +812,7 @@ static ssize_t dmemcg_limit_write(struct kernfs_open_file *of,
}
static int dmemcg_limit_show(struct seq_file *sf, void *v,
- u64 (*fn)(struct dmem_cgroup_pool_state *))
+ u64 (*fn)(struct seq_file *, struct dmem_cgroup_pool_state *))
{
struct dmemcg_state *dmemcs = css_to_dmemcs(seq_css(sf));
struct dmem_cgroup_region *region;
@@ -796,7 +824,7 @@ static int dmemcg_limit_show(struct seq_file *sf, void *v,
seq_puts(sf, region->name);
- val = fn(pool);
+ val = fn(sf, pool);
if (val < PAGE_COUNTER_MAX)
seq_printf(sf, " %lld\n", val);
else
@@ -807,6 +835,90 @@ static int dmemcg_limit_show(struct seq_file *sf, void *v,
return 0;
}
+static int dmem_cgroup_region_peak_open(struct kernfs_open_file *of)
+{
+ struct cgroup_of_peak *ofp = of_peak(of);
+
+ ofp->value = OFP_PEAK_UNSET;
+
+ return 0;
+}
+
+static void dmem_cgroup_region_peak_remove(struct cgroup_of_peak *ofp)
+{
+ struct dmem_cgroup_pool_state *pool;
+ struct dmemcg_state *dmemcs;
+
+ pool = xchg(&ofp->pool, NULL);
+ if (!pool)
+ return;
+
+ dmemcs = pool->cs;
+
+ spin_lock(&dmemcs->peaks_lock);
+ list_del(&ofp->list);
+ spin_unlock(&dmemcs->peaks_lock);
+
+ WRITE_ONCE(ofp->value, OFP_PEAK_UNSET);
+
+ dmemcg_pool_put(pool);
+}
+
+static void dmem_cgroup_region_peak_release(struct kernfs_open_file *of)
+{
+ struct cgroup_of_peak *ofp = of_peak(of);
+
+ if (ofp->value == OFP_PEAK_UNSET) {
+ /* fast path (no writes on this fd) */
+ return;
+ }
+
+ dmem_cgroup_region_peak_remove(ofp);
+}
+
+static ssize_t dmem_cgroup_region_peak_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct dmemcg_state *dmemcs = css_to_dmemcs(of_css(of));
+ struct cgroup_of_peak *ofp = of_peak(of);
+ struct dmem_cgroup_pool_state *pool = NULL;
+ struct dmem_cgroup_region *region;
+ int err = 0;
+
+ buf = strstrip(buf);
+ if (!buf[0])
+ return -EINVAL;
+
+ rcu_read_lock();
+ region = dmemcg_get_region_by_name(buf);
+ rcu_read_unlock();
+
+ if (!region)
+ return -EINVAL;
+
+ pool = get_cg_pool_unlocked(dmemcs, region);
+ if (IS_ERR(pool)) {
+ err = PTR_ERR(pool);
+ goto out_put;
+ }
+
+ dmem_cgroup_region_peak_remove(ofp);
+
+ xchg(&ofp->pool, pool);
+ spin_lock(&dmemcs->peaks_lock);
+ of_peak_reset(ofp, &pool->cnt, &pool->peaks);
+ spin_unlock(&dmemcs->peaks_lock);
+
+out_put:
+ kref_put(®ion->ref, dmemcg_free_region);
+ return err ?: nbytes;
+}
+
+static int dmem_cgroup_region_peak_show(struct seq_file *sf, void *v)
+{
+ return dmemcg_limit_show(sf, v, get_resource_peak);
+}
+
static int dmem_cgroup_region_current_show(struct seq_file *sf, void *v)
{
return dmemcg_limit_show(sf, v, get_resource_current);
@@ -855,6 +967,14 @@ static struct cftype files[] = {
.name = "current",
.seq_show = dmem_cgroup_region_current_show,
},
+ {
+ .name = "peak",
+ .open = dmem_cgroup_region_peak_open,
+ .release = dmem_cgroup_region_peak_release,
+ .write = dmem_cgroup_region_peak_write,
+ .seq_show = dmem_cgroup_region_peak_show,
+ .flags = CFTYPE_NOT_ON_ROOT,
+ },
{
.name = "min",
.write = dmem_cgroup_region_min_write,
--
2.47.3
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 0/2] cgroup/dmem: introduce a peak file
2026-05-06 11:58 [PATCH 0/2] cgroup/dmem: introduce a peak file Thadeu Lima de Souza Cascardo
2026-05-06 11:58 ` [PATCH 1/2] mm/page_counter: decouple peak_reset from peak_write Thadeu Lima de Souza Cascardo
2026-05-06 11:58 ` [PATCH 2/2] cgroup/dmem: introduce a peak file Thadeu Lima de Souza Cascardo
@ 2026-05-06 13:53 ` Michal Koutný
2026-05-06 14:18 ` Thadeu Lima de Souza Cascardo
2026-05-07 3:37 ` Claude review: " Claude Code Review Bot
3 siblings, 1 reply; 9+ messages in thread
From: Michal Koutný @ 2026-05-06 13:53 UTC (permalink / raw)
To: Thadeu Lima de Souza Cascardo
Cc: Tejun Heo, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Andrew Morton, Jonathan Corbet,
Shuah Khan, Maarten Lankhorst, Maxime Ripard, Natalie Vock,
Tvrtko Ursulin, cgroups, linux-kernel, linux-mm, linux-doc,
dri-devel, kernel-dev
[-- Attachment #1: Type: text/plain, Size: 1205 bytes --]
Hello Thadeu.
On Wed, May 06, 2026 at 08:58:23AM -0300, Thadeu Lima de Souza Cascardo <cascardo@igalia.com> wrote:
> Just like we have memory.peak, introduce a dmem.peak, which uses the
> page_counter support for that.
>
> It can be written to in order to reset the peak, but different from
> memory.peak, which expects any write, dmem.peak expects the region name to
> be written to it. That region peak is the one that is reset.
>
> That requires ofp_peak to carry a pointer to the pool that was reset.
(It'd be nicer to have generic data in that generic structure, at least
some void *priv. But see below.)
> Writing a different region name will reset the different region and make
> the original region peak get back to its non-reset value.
I'm slightly confused by this fds x pool matricity when there's only
a single slot in cgroup_file_ctx::cgroup_of_peak.
The intended use case is that users should maintain one fd per pool and
not mix it up?
This stanza would better fit to cgroup-v2.rst proper than the commit
message. Or make it simpler and start with non-resettable peak file
(like memory.peak had started too) and see how it fares. WDYT?
Thanks,
Michal
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 0/2] cgroup/dmem: introduce a peak file
2026-05-06 13:53 ` [PATCH 0/2] " Michal Koutný
@ 2026-05-06 14:18 ` Thadeu Lima de Souza Cascardo
2026-05-06 15:09 ` Michal Koutný
0 siblings, 1 reply; 9+ messages in thread
From: Thadeu Lima de Souza Cascardo @ 2026-05-06 14:18 UTC (permalink / raw)
To: Michal Koutný
Cc: Tejun Heo, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Andrew Morton, Jonathan Corbet,
Shuah Khan, Maarten Lankhorst, Maxime Ripard, Natalie Vock,
Tvrtko Ursulin, cgroups, linux-kernel, linux-mm, linux-doc,
dri-devel, kernel-dev
On Wed, May 06, 2026 at 03:53:59PM +0200, Michal Koutný wrote:
> Hello Thadeu.
>
> On Wed, May 06, 2026 at 08:58:23AM -0300, Thadeu Lima de Souza Cascardo <cascardo@igalia.com> wrote:
> > Just like we have memory.peak, introduce a dmem.peak, which uses the
> > page_counter support for that.
> >
> > It can be written to in order to reset the peak, but different from
> > memory.peak, which expects any write, dmem.peak expects the region name to
> > be written to it. That region peak is the one that is reset.
> >
> > That requires ofp_peak to carry a pointer to the pool that was reset.
>
> (It'd be nicer to have generic data in that generic structure, at least
> some void *priv. But see below.)
>
I used void *, at first, but as the only current use is for the pool and as
mixing different uses may lead to misuse, I thought it would be safer to
use the type directly. This has been pointed out before for other members
of cgroup_file_ctx. See [1].
> > Writing a different region name will reset the different region and make
> > the original region peak get back to its non-reset value.
>
> I'm slightly confused by this fds x pool matricity when there's only
> a single slot in cgroup_file_ctx::cgroup_of_peak.
>
> The intended use case is that users should maintain one fd per pool and
> not mix it up?
> This stanza would better fit to cgroup-v2.rst proper than the commit
> message. Or make it simpler and start with non-resettable peak file
> (like memory.peak had started too) and see how it fares. WDYT?
>
That is also due to the limitation that each file descriptor has a single
linked list under cgroup_file_ctx::cgroup_of_peak. To allow for all the
pools to be reset at once, we would need one list per file descriptor.
But, on the other hand, as you pointed out, this leads to the flexibility
of being able to reset only one pool, while leaving the others as is.
Whereas, if one needs to reset all pools, they can use one file descriptor
per region.
I started with a non-resettable peak file, but as memory.peak can be reset,
I added that feature too. If we want to merge a non-resettable support
ealier and need to take longer to discuss how to work on the resettable
support given the above, I can resubmit. But I guess we can see if we can
reach an agreement sonner rather than later.
Thanks.
Cascardo.
>
> Thanks,
> Michal
[1] https://lore.kernel.org/all/CAHk-=wgiYkECT=hZRKj8ZwfBPw2Uz=gpOGBGd4ny0KYhSsjC0w@mail.gmail.com/
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 0/2] cgroup/dmem: introduce a peak file
2026-05-06 14:18 ` Thadeu Lima de Souza Cascardo
@ 2026-05-06 15:09 ` Michal Koutný
0 siblings, 0 replies; 9+ messages in thread
From: Michal Koutný @ 2026-05-06 15:09 UTC (permalink / raw)
To: Thadeu Lima de Souza Cascardo
Cc: Tejun Heo, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Andrew Morton, Jonathan Corbet,
Shuah Khan, Maarten Lankhorst, Maxime Ripard, Natalie Vock,
Tvrtko Ursulin, cgroups, linux-kernel, linux-mm, linux-doc,
dri-devel, kernel-dev
[-- Attachment #1: Type: text/plain, Size: 1514 bytes --]
On Wed, May 06, 2026 at 11:18:26AM -0300, Thadeu Lima de Souza Cascardo <cascardo@igalia.com> wrote:
> I used void *, at first, but as the only current use is for the pool and as
> mixing different uses may lead to misuse, I thought it would be safer to
> use the type directly. This has been pointed out before for other members
> of cgroup_file_ctx. See [1].
That mail reacts to union overlaps and pointer vs embedded struct
allocations. Correct me if I missed your part.
I agree that having properly typed pointer is safer.
cgroup_file_ctx sub-structs are for generic cgroup files. But here
somehow a specific controller needs propagated to the generic member.
What about storing also the `list_head *watchers` inside `struct
cgroup_of_peak` and each subsys would manage it as needed?
(ofp->watchers == NULL could also substitute ofp->value ==
OFP_PEAK_UNSET)
> I started with a non-resettable peak file, but as memory.peak can be reset,
> I added that feature too.
At the same time pids.peak has survived without reset option till today.
> If we want to merge a non-resettable support ealier and need to take
> longer to discuss how to work on the resettable support given the
> above, I can resubmit. But I guess we can see if we can reach an
> agreement sonner rather than later.
What kind of users do you envision (i.e. would they need resets at all)?
Anyway, the behavior should be explained in cgroup-v2.rst since that's
where they'll look for it.
HTH,
Michal
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Claude review: cgroup/dmem: introduce a peak file
2026-05-06 11:58 [PATCH 0/2] cgroup/dmem: introduce a peak file Thadeu Lima de Souza Cascardo
` (2 preceding siblings ...)
2026-05-06 13:53 ` [PATCH 0/2] " Michal Koutný
@ 2026-05-07 3:37 ` Claude Code Review Bot
3 siblings, 0 replies; 9+ messages in thread
From: Claude Code Review Bot @ 2026-05-07 3:37 UTC (permalink / raw)
To: dri-devel-reviews
Overall Series Review
Subject: cgroup/dmem: introduce a peak file
Author: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
Patches: 6
Reviewed: 2026-05-07T13:37:55.340313
---
This 2-patch series adds a `dmem.peak` cgroup file, mirroring the existing `memory.peak` functionality for the DRM device-memory cgroup controller. Patch 1 extracts the peak-reset logic from memcontrol into a shared helper; Patch 2 uses it to implement per-region peak tracking in dmem.
**The overall approach is sound and the core logic is correct.** The reference counting is carefully handled, concurrent read/write races are safe due to appropriate use of `READ_ONCE`/`WRITE_ONCE` and fallback to global watermark, and the per-fd peak semantics correctly mirror the existing memory.peak behavior.
There are a few design concerns worth discussing, mainly around the addition of a dmem-specific `pool` pointer into the generic `cgroup_of_peak` struct, and the unused `seq_file *` parameter added to four getter functions. Neither is a correctness issue, but both are layering concerns that reviewers from the cgroup/mm side may push back on.
No bugs found. The series is in good shape for a v1.
---
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 9+ messages in thread
* Claude review: mm/page_counter: decouple peak_reset from peak_write
2026-05-06 11:58 ` [PATCH 1/2] mm/page_counter: decouple peak_reset from peak_write Thadeu Lima de Souza Cascardo
@ 2026-05-07 3:37 ` Claude Code Review Bot
0 siblings, 0 replies; 9+ messages in thread
From: Claude Code Review Bot @ 2026-05-07 3:37 UTC (permalink / raw)
To: dri-devel-reviews
Patch Review
This patch extracts the core peak-reset logic from `peak_write()` in `mm/memcontrol.c` into a new shared `of_peak_reset()` helper in `kernel/cgroup/cgroup.c`, and refactors the memcontrol callers to use it.
**Correctness: Good.** The extracted function is a faithful copy of the original logic. Both `memory_peak_write()` and `swap_peak_write()` now inline the lock acquire/release around the shared helper, maintaining identical locking semantics.
**Minor issues:**
1. **Subject prefix mismatch:** The commit title says `mm/page_counter:` but the new function is added to `kernel/cgroup/cgroup.c`, not to `mm/page_counter.c`. A more accurate prefix would be `cgroup:` or `mm/memcontrol:` since the refactoring touches `mm/memcontrol.c` and `kernel/cgroup/cgroup.c`.
2. **Header dependency added to `cgroup-defs.h`:**
```c
+#include <linux/page_counter.h>
```
`cgroup-defs.h` is included very widely. Adding `page_counter.h` there pulls the `page_counter` struct definition into every compilation unit that includes cgroup headers. Consider whether a forward declaration or moving the `of_peak_reset()` declaration to a separate header (e.g., `page_counter.h` itself) would be cleaner. The cgroup-defs.h header is included from sched.h paths, so this could have non-trivial build-time impact.
3. **The `of_peak_reset()` function docstring** is well-written and accurately describes the contract. The requirement that "a lock must be used to protect @watchers" is correctly documented.
4. **No `EXPORT_SYMBOL` for `of_peak_reset()`:** This is fine if `CONFIG_CGROUP_DMEM` is `bool` (built-in only), but worth verifying. If dmem can ever be built as a module, this would need an export.
**The refactoring is clean and preserves behavior.** No concerns about the functional changes.
---
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 9+ messages in thread
* Claude review: cgroup/dmem: introduce a peak file
2026-05-06 11:58 ` [PATCH 2/2] cgroup/dmem: introduce a peak file Thadeu Lima de Souza Cascardo
@ 2026-05-07 3:37 ` Claude Code Review Bot
0 siblings, 0 replies; 9+ messages in thread
From: Claude Code Review Bot @ 2026-05-07 3:37 UTC (permalink / raw)
To: dri-devel-reviews
Patch Review
This is the main patch, adding `dmem.peak` file support with per-region, per-fd peak tracking.
**Correctness: Good.** The peak tracking semantics are correct:
- Global watermark (`pool->cnt.watermark`) is reported for regions the fd has not reset.
- Per-fd peak (`max(ofp->value, pool->cnt.local_watermark)`) is reported for the region the fd has reset.
- Writing a different region name correctly "detaches" from the old region (reverting to global watermark) and "attaches" to the new one.
**Reference counting is correct:** `get_cg_pool_unlocked()` acquires a pool ref that is transferred to `ofp->pool`. `dmem_cgroup_region_peak_remove()` drops it. The region ref from `dmemcg_get_region_by_name()` is dropped at `out_put`. The release path handles both the fast path (no writes) and the watcher-cleanup path.
**Design concerns:**
1. **`pool` pointer in generic `cgroup_of_peak` struct:**
```c
struct cgroup_of_peak {
unsigned long value;
struct list_head list;
+ struct dmem_cgroup_pool_state *pool;
};
```
This adds a DRM-subsystem-specific pointer to a struct used by the generic cgroup infrastructure and the memory controller. Every `memory.peak` open file now carries an unused `struct dmem_cgroup_pool_state *pool`. While harmless (it's zero-initialized via `kzalloc` of the containing `cgroup_file_ctx`), this is a layering violation that cgroup/mm maintainers may object to. Consider using a wrapper struct in dmem.c or adding a `void *private` to `cgroup_of_peak` instead.
2. **Unused `struct seq_file *sf` parameter added to four functions:**
```c
-static u64 get_resource_low(struct dmem_cgroup_pool_state *pool)
+static u64 get_resource_low(struct seq_file *sf, struct dmem_cgroup_pool_state *pool)
```
`get_resource_low()`, `get_resource_min()`, `get_resource_max()`, and `get_resource_current()` all gain an `sf` parameter they never use, solely to match the callback signature needed by `get_resource_peak()`. This is a code smell. An alternative would be to have the show function pass a context struct or to handle the peak case with a separate callback type. At minimum, marking the parameter `__always_unused` would signal the intent.
3. **`get_resource_peak()` reads `ofp` without any lock:**
```c
static u64 get_resource_peak(struct seq_file *sf, struct dmem_cgroup_pool_state *pool)
{
struct cgroup_of_peak *ofp = of_peak(sf->private);
...
of_pool = READ_ONCE(ofp->pool);
fd_peak = READ_ONCE(ofp->value);
if (of_pool != pool || fd_peak == OFP_PEAK_UNSET)
peak = pool->cnt.watermark;
else
peak = max(fd_peak, READ_ONCE(pool->cnt.local_watermark));
```
This is intentionally lockless for the read path, relying on `READ_ONCE` for individual fields. There is a window where `of_pool` could point to a freed pool (between `dmem_cgroup_region_peak_remove()` setting it to NULL and a concurrent read having already loaded the old value). However, since the pointer is only used for an equality comparison against the valid `pool` from the iterator — never dereferenced — this is safe. The fallback to `pool->cnt.watermark` on mismatch makes this robust. This is fine but worth a brief comment in the code explaining why the lockless read is safe.
4. **The `dmem_cgroup_region_peak_remove()` function does not check `ofp->value == OFP_PEAK_UNSET`:**
```c
static void dmem_cgroup_region_peak_remove(struct cgroup_of_peak *ofp)
{
struct dmem_cgroup_pool_state *pool;
...
pool = xchg(&ofp->pool, NULL);
if (!pool)
return;
...
```
It relies on `ofp->pool == NULL` for the "no prior write" case, which is correct since `pool` is only set in the write path. But `dmem_cgroup_region_peak_release()` additionally checks `ofp->value == OFP_PEAK_UNSET` as a fast path before calling `remove`. This is slightly redundant but harmless — the two checks are equivalent for the no-write case.
5. **Documentation is good:**
```
+ dmem.peak
+ A readwrite nested-keyed file that exists on non-root cgroups.
+ ...
```
Clear and follows the existing documentation style. The semantics are well-described.
6. **Initialization in `alloc_pool_single()`:**
```c
+ INIT_LIST_HEAD(&pool->peaks);
```
Correctly initializes the peaks watcher list for new pools. The `spin_lock_init(&dmemcs->peaks_lock)` in `dmemcs_alloc()` is also correct.
**One question for the author:** When multiple fds reset the same region, the `local_watermark` is lowered to current usage on each reset. This means fd1's peak could temporarily appear lower than its actual peak-since-reset if fd2 resets afterward at a lower usage point — until new allocations push `local_watermark` back up. This is the same behavior as `memory.peak` so it's consistent, but worth confirming that this is the intended semantic for dmem as well, since device memory allocation patterns may differ from system memory.
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2026-05-07 3:37 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-06 11:58 [PATCH 0/2] cgroup/dmem: introduce a peak file Thadeu Lima de Souza Cascardo
2026-05-06 11:58 ` [PATCH 1/2] mm/page_counter: decouple peak_reset from peak_write Thadeu Lima de Souza Cascardo
2026-05-07 3:37 ` Claude review: " Claude Code Review Bot
2026-05-06 11:58 ` [PATCH 2/2] cgroup/dmem: introduce a peak file Thadeu Lima de Souza Cascardo
2026-05-07 3:37 ` Claude review: " Claude Code Review Bot
2026-05-06 13:53 ` [PATCH 0/2] " Michal Koutný
2026-05-06 14:18 ` Thadeu Lima de Souza Cascardo
2026-05-06 15:09 ` Michal Koutný
2026-05-07 3:37 ` Claude review: " Claude Code Review Bot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox