[PATCH] dma-buf: system_heap: Optimize sg_table-to-pages conversion in vmap

public inbox for drm-ai-reviews@public-inbox.freedesktop.org
 help / color / mirror / Atom feed

* [PATCH] dma-buf: system_heap: Optimize sg_table-to-pages conversion in vmap
@ 2026-04-06 21:49 Barry Song (Xiaomi)
  2026-04-07  7:57 ` Christian König
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-06 21:49 UTC (permalink / raw)
  To: linux-media, dri-devel, linaro-mm-sig
  Cc: linux-kernel, Xueyuan Chen, Sumit Semwal, Benjamin Gaignard,
	Brian Starkey, John Stultz, T . J . Mercier, Christian König,
	Barry Song

From: Xueyuan Chen <Xueyuan.chen21@gmail.com>

Replace the heavy for_each_sgtable_page() iterator in system_heap_do_vmap()
with a more efficient nested loop approach.

Instead of iterating page by page, we now iterate through the scatterlist
entries via for_each_sgtable_sg(). Because pages within a single sg entry
are physically contiguous, we can populate the page array with a in an
inner loop using simple pointer math. This save a lot of time.

The WARN_ON check is also pulled out of the loop to save branch
instructions.

Performance results mapping a 2GB buffer on Radxa O6:
- Before: ~1440000 ns
- After:  ~232000 ns
(~84% reduction in iteration time, or ~6.2x faster)

Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Benjamin Gaignard <benjamin.gaignard@collabora.com>
Cc: Brian Starkey <Brian.Starkey@arm.com>
Cc: John Stultz <jstultz@google.com>
Cc: T.J. Mercier <tjmercier@google.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Xueyuan Chen <Xueyuan.chen21@gmail.com>
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
 drivers/dma-buf/heaps/system_heap.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
index b3650d8fd651..769f01f0cc96 100644
--- a/drivers/dma-buf/heaps/system_heap.c
+++ b/drivers/dma-buf/heaps/system_heap.c
@@ -224,16 +224,21 @@ static void *system_heap_do_vmap(struct system_heap_buffer *buffer)
 	int npages = PAGE_ALIGN(buffer->len) / PAGE_SIZE;
 	struct page **pages = vmalloc(sizeof(struct page *) * npages);
 	struct page **tmp = pages;
-	struct sg_page_iter piter;
 	void *vaddr;
+	u32 i, j, count;
+	struct page *base_page;
+	struct scatterlist *sg;
 
 	if (!pages)
 		return ERR_PTR(-ENOMEM);
 
-	for_each_sgtable_page(table, &piter, 0) {
-		WARN_ON(tmp - pages >= npages);
-		*tmp++ = sg_page_iter_page(&piter);
+	for_each_sgtable_sg(table, sg, i) {
+		base_page = sg_page(sg);
+		count = sg->length >> PAGE_SHIFT;
+		for (j = 0; j < count; j++)
+			*tmp++ = base_page + j;
 	}
+	WARN_ON(tmp - pages != npages);
 
 	vaddr = vmap(pages, npages, VM_MAP, PAGE_KERNEL);
 	vfree(pages);
-- 
2.39.3 (Apple Git-146)


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] dma-buf: system_heap: Optimize sg_table-to-pages conversion in vmap
  2026-04-06 21:49 [PATCH] dma-buf: system_heap: Optimize sg_table-to-pages conversion in vmap Barry Song (Xiaomi)
@ 2026-04-07  7:57 ` Christian König
  2026-04-07 11:29   ` Barry Song
  2026-04-12  4:23 ` Claude review: " Claude Code Review Bot
  2026-04-12  4:23 ` Claude Code Review Bot
  2 siblings, 1 reply; 5+ messages in thread
From: Christian König @ 2026-04-07  7:57 UTC (permalink / raw)
  To: Barry Song (Xiaomi), linux-media, dri-devel, linaro-mm-sig
  Cc: linux-kernel, Xueyuan Chen, Sumit Semwal, Benjamin Gaignard,
	Brian Starkey, John Stultz, T . J . Mercier

On 4/6/26 23:49, Barry Song (Xiaomi) wrote:
> From: Xueyuan Chen <Xueyuan.chen21@gmail.com>
> 
> Replace the heavy for_each_sgtable_page() iterator in system_heap_do_vmap()
> with a more efficient nested loop approach.
> 
> Instead of iterating page by page, we now iterate through the scatterlist
> entries via for_each_sgtable_sg(). Because pages within a single sg entry
> are physically contiguous, we can populate the page array with a in an
> inner loop using simple pointer math. This save a lot of time.
> 
> The WARN_ON check is also pulled out of the loop to save branch
> instructions.
> 
> Performance results mapping a 2GB buffer on Radxa O6:
> - Before: ~1440000 ns
> - After:  ~232000 ns
> (~84% reduction in iteration time, or ~6.2x faster)

Well real question is why do you care about the vmap performance?

That should basically only be used for fbdev emulation (except for VMGFX) and we absolutely don't care about performance there.

Regards,
Christian.

> 
> Cc: Sumit Semwal <sumit.semwal@linaro.org>
> Cc: Benjamin Gaignard <benjamin.gaignard@collabora.com>
> Cc: Brian Starkey <Brian.Starkey@arm.com>
> Cc: John Stultz <jstultz@google.com>
> Cc: T.J. Mercier <tjmercier@google.com>
> Cc: Christian König <christian.koenig@amd.com>
> Signed-off-by: Xueyuan Chen <Xueyuan.chen21@gmail.com>
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> ---
>  drivers/dma-buf/heaps/system_heap.c | 13 +++++++++----
>  1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
> index b3650d8fd651..769f01f0cc96 100644
> --- a/drivers/dma-buf/heaps/system_heap.c
> +++ b/drivers/dma-buf/heaps/system_heap.c
> @@ -224,16 +224,21 @@ static void *system_heap_do_vmap(struct system_heap_buffer *buffer)
>  	int npages = PAGE_ALIGN(buffer->len) / PAGE_SIZE;
>  	struct page **pages = vmalloc(sizeof(struct page *) * npages);
>  	struct page **tmp = pages;
> -	struct sg_page_iter piter;
>  	void *vaddr;
> +	u32 i, j, count;
> +	struct page *base_page;
> +	struct scatterlist *sg;
>  
>  	if (!pages)
>  		return ERR_PTR(-ENOMEM);
>  
> -	for_each_sgtable_page(table, &piter, 0) {
> -		WARN_ON(tmp - pages >= npages);
> -		*tmp++ = sg_page_iter_page(&piter);
> +	for_each_sgtable_sg(table, sg, i) {
> +		base_page = sg_page(sg);
> +		count = sg->length >> PAGE_SHIFT;
> +		for (j = 0; j < count; j++)
> +			*tmp++ = base_page + j;
>  	}
> +	WARN_ON(tmp - pages != npages);
>  
>  	vaddr = vmap(pages, npages, VM_MAP, PAGE_KERNEL);
>  	vfree(pages);


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] dma-buf: system_heap: Optimize sg_table-to-pages conversion in vmap
  2026-04-07  7:57 ` Christian König
@ 2026-04-07 11:29   ` Barry Song
  0 siblings, 0 replies; 5+ messages in thread
From: Barry Song @ 2026-04-07 11:29 UTC (permalink / raw)
  To: Christian König
  Cc: linux-media, dri-devel, linaro-mm-sig, linux-kernel, Xueyuan Chen,
	Sumit Semwal, Benjamin Gaignard, Brian Starkey, John Stultz,
	T . J . Mercier

On Tue, Apr 7, 2026 at 3:58 PM Christian König <christian.koenig@amd.com> wrote:
>
> On 4/6/26 23:49, Barry Song (Xiaomi) wrote:
> > From: Xueyuan Chen <Xueyuan.chen21@gmail.com>
> >
> > Replace the heavy for_each_sgtable_page() iterator in system_heap_do_vmap()
> > with a more efficient nested loop approach.
> >
> > Instead of iterating page by page, we now iterate through the scatterlist
> > entries via for_each_sgtable_sg(). Because pages within a single sg entry
> > are physically contiguous, we can populate the page array with a in an
> > inner loop using simple pointer math. This save a lot of time.
> >
> > The WARN_ON check is also pulled out of the loop to save branch
> > instructions.
> >
> > Performance results mapping a 2GB buffer on Radxa O6:
> > - Before: ~1440000 ns
> > - After:  ~232000 ns
> > (~84% reduction in iteration time, or ~6.2x faster)
>
> Well real question is why do you care about the vmap performance?
>
> That should basically only be used for fbdev emulation (except for VMGFX) and we absolutely don't care about performance there.

I agree that in mainline, dma_buf_vmap is not used very often.
Here’s what I was able to find:

  1   1638  drivers/dma-buf/dma-buf.c <<dma_buf_vmap_unlocked>>
             ret = dma_buf_vmap(dmabuf, map);
   2    376  drivers/gpu/drm/drm_gem_shmem_helper.c
<<drm_gem_shmem_vmap_locked>>
             ret = dma_buf_vmap(obj->import_attach->dmabuf, map);
   3     85  drivers/gpu/drm/etnaviv/etnaviv_gem_prime.c
<<etnaviv_gem_prime_vmap_impl>>
             ret = dma_buf_vmap(etnaviv_obj->base.import_attach->dmabuf, &map);
   4    433  drivers/gpu/drm/vmwgfx/vmwgfx_blit.c <<map_external>>
             ret = dma_buf_vmap(bo->tbo.base.dma_buf, map);
   5     88  drivers/gpu/drm/vmwgfx/vmwgfx_gem.c <<vmw_gem_vmap>>
             ret = dma_buf_vmap(obj->import_attach->dmabuf, map);

However, in the Android ecosystem, system_heap and similar heaps
are widely used across camera, NPU, and media drivers. Many of these
drivers are not in mainline but do use vmap() in real code paths.

As I can show you some of them from MTK platforms:

1:
[    6.689849] system_heap_vmap+0x17c/0x254 [system_heap
8d35d4ce35bb30d8a623f0b9863998a2528e4175]
[    6.689859] dma_buf_vmap_unlocked+0xb8/0x130
[    6.689861] aov_core_init+0x310/0x718 [mtk_aov
96e2e5e9457dcdacce3a7629b0600c5dbeca623b]
[    6.689873] mtk_aov_probe+0x434/0x5b4 [mtk_aov
96e2e5e9457dcdacce3a7629b0600c5dbeca623b]

2:
[  116.181643] __vmap_pages_range_noflush+0x7c4/0x814
[  116.181645] vmap+0xb4/0x148
[  116.181647] system_heap_vmap+0x17c/0x254 [system_heap
8d35d4ce35bb30d8a623f0b9863998a2528e4175]
[  116.181651] dma_buf_vmap_unlocked+0xb8/0x130
[  116.181653] mtk_cam_vb2_vaddr+0xa0/0xfc [mtk_cam_isp8s
0cf9be6c773a8f14aab9db9ebf53feacb499846a]
[  116.181682] vb2_plane_vaddr+0x5c/0x78
[  116.181684] mtk_cam_job_fill_ipi_frame+0xa8c/0x128c [mtk_cam_isp8s
0cf9be6c773a8f14aab9db9ebf53feacb499846a]

3:
[  116.306178] __vmap_pages_range_noflush+0x7c4/0x814
[  116.306183] vmap+0xb4/0x148
[  116.306187] system_heap_vmap+0x17c/0x254 [system_heap
8d35d4ce35bb30d8a623f0b9863998a2528e4175]
[  116.306209] dma_buf_vmap_unlocked+0xb8/0x130
[  116.306212] apu_sysmem_alloc+0x168/0x360 [apusys
8fb33cbce3b858d651b9da26fc370090a67cfb70]
[  116.306468] mdw_mem_alloc+0xd8/0x314 [apusys
8fb33cbce3b858d651b9da26fc370090a67cfb70]
[  116.306591] mdw_mem_pool_chunk_add+0x11c/0x400 [apusys
8fb33cbce3b858d651b9da26fc370090a67cfb70]
[  116.306712] mdw_mem_pool_create+0x190/0x2c8 [apusys
8fb33cbce3b858d651b9da26fc370090a67cfb70]
[  116.306833] mdw_drv_open+0x21c/0x47c [apusys
8fb33cbce3b858d651b9da26fc370090a67cfb70]

While we may want to encourage more of these drivers to upstream,
some aspects are beyond our control (different SoC vendors), but we
can at least contribute upstream ourselves.

Best Regards
Barry

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Claude review: dma-buf: system_heap: Optimize sg_table-to-pages conversion in vmap
  2026-04-06 21:49 [PATCH] dma-buf: system_heap: Optimize sg_table-to-pages conversion in vmap Barry Song (Xiaomi)
  2026-04-07  7:57 ` Christian König
@ 2026-04-12  4:23 ` Claude Code Review Bot
  2026-04-12  4:23 ` Claude Code Review Bot
  2 siblings, 0 replies; 5+ messages in thread
From: Claude Code Review Bot @ 2026-04-12  4:23 UTC (permalink / raw)
  To: dri-devel-reviews

Overall Series Review

Subject: dma-buf: system_heap: Optimize sg_table-to-pages conversion in vmap
Author: "Barry Song (Xiaomi)" <baohua@kernel.org>
Patches: 3
Reviewed: 2026-04-12T14:23:03.509679

---

This is a single-patch optimization for `system_heap_do_vmap()` in the DMA-buf system heap. The patch replaces the `for_each_sgtable_page()` iterator (which calls into `__sg_page_iter_next()` for every single page) with a nested loop that iterates over scatterlist entries and then fills pages using simple pointer arithmetic within each entry. The approach is sound and the performance improvement is plausible -- `for_each_sgtable_page()` involves non-trivial per-page overhead (function call to `__sg_page_iter_next()`, sg boundary checking, etc.) that is completely unnecessary when the caller knows each sg entry's pages are contiguous.

The optimization is **correct** for the system heap because every sg entry is created with offset=0 and length=page_size(page) (see `system_heap_allocate()` at line 394 of the current source), guaranteeing page-aligned lengths and contiguous pages within each entry.

The change to move the `WARN_ON` out of the loop and convert it from `>=` (overflow guard) to `!=` (exact match) is also a good improvement -- it catches both over-counting *and* under-counting.

**Minor issues noted below, but overall this is a clean, well-motivated optimization.**

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Claude review: dma-buf: system_heap: Optimize sg_table-to-pages conversion in vmap
  2026-04-06 21:49 [PATCH] dma-buf: system_heap: Optimize sg_table-to-pages conversion in vmap Barry Song (Xiaomi)
  2026-04-07  7:57 ` Christian König
  2026-04-12  4:23 ` Claude review: " Claude Code Review Bot
@ 2026-04-12  4:23 ` Claude Code Review Bot
  2 siblings, 0 replies; 5+ messages in thread
From: Claude Code Review Bot @ 2026-04-12  4:23 UTC (permalink / raw)
  To: dri-devel-reviews

Patch Review

**Correctness: Good**

The core transformation is equivalent. The original code does:
```c
for_each_sgtable_page(table, &piter, 0) {
    *tmp++ = sg_page_iter_page(&piter);
}
```
Where `sg_page_iter_page()` returns `sg_page(piter->sg) + piter->sg_pgoffset`.

The new code does:
```c
for_each_sgtable_sg(table, sg, i) {
    base_page = sg_page(sg);
    count = sg->length >> PAGE_SHIFT;
    for (j = 0; j < count; j++)
        *tmp++ = base_page + j;
}
```

These are functionally identical for the system heap's sg entries (offset=0, page-aligned lengths).

**Minor Issues:**

1. **Type of loop counter `i`**: The variable `i` is declared as `u32`, but `for_each_sgtable_sg` expands to `for_each_sg` which expects `int` for the counter. Looking at the macro:
   ```c
   #define for_each_sg(sglist, sg, nr, __i) \
       for (__i = 0, sg = (sglist); __i < (nr); __i++, sg = sg_next(sg))
   ```
   And `orig_nents` is `unsigned int`. Using `u32` will work but is inconsistent with typical kernel usage where `int` is used. Not a bug, but worth noting for style consistency.

2. **Type of `count`**: Using `u32` is fine since `sg->length` is `unsigned int` (32-bit), so `sg->length >> PAGE_SHIFT` always fits in a `u32`. No issue here.

3. **Commit message nit**: "we can populate the page array with a in an inner loop" has a stray "a" -- should be "with an inner loop" or "in an inner loop".

4. **WARN_ON semantic change is an improvement**: The original `WARN_ON(tmp - pages >= npages)` only guards against writing past the array. The new `WARN_ON(tmp - pages != npages)` after the loop detects both overflow *and* underflow (mismatch between `buffer->len` and the total sg lengths). This is strictly better.

5. **No bounds checking during the inner loop**: The original code had the `WARN_ON` inside the loop to catch overflow *before* writing. The new code removes this guard entirely and only checks *after* all writes. If there's ever a bug where the sg entries describe more pages than `npages`, the code will silently write past the `pages` array before the WARN_ON fires. In practice, this is fine for the system heap (the allocator creates the sg table from the same pages, so they must agree), and the performance benefit justifies removing the per-iteration check. But if reviewers feel strongly, an alternative would be `WARN_ON(tmp - pages > npages)` after the outer loop, before the vmap call -- which this patch effectively does with the `!=` check.

6. **Consider whether this pattern should use `sg_page_count()`**: There's no `sg_page_count()` helper in the kernel, so `sg->length >> PAGE_SHIFT` is the standard idiom. Fine as-is.

**Overall verdict**: The patch is correct and the optimization is well-motivated with good benchmarks. The only actionable feedback is the typo in the commit message. The rest is clean.

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-04-12  4:23 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-06 21:49 [PATCH] dma-buf: system_heap: Optimize sg_table-to-pages conversion in vmap Barry Song (Xiaomi)
2026-04-07  7:57 ` Christian König
2026-04-07 11:29   ` Barry Song
2026-04-12  4:23 ` Claude review: " Claude Code Review Bot
2026-04-12  4:23 ` Claude Code Review Bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox