[PATCH] drm/amdgpu: clamp user gartsize against device capacity

public inbox for drm-ai-reviews@public-inbox.freedesktop.org
 help / color / mirror / Atom feed

* [PATCH] drm/amdgpu: clamp user gartsize against device capacity
@ 2026-05-14 19:49 Ahmed Elmetwally
  2026-05-16  0:27 ` Claude review: " Claude Code Review Bot
  2026-05-16  0:27 ` Claude Code Review Bot
  0 siblings, 2 replies; 3+ messages in thread
From: Ahmed Elmetwally @ 2026-05-14 19:49 UTC (permalink / raw)
  To: alexander.deucher, christian.koenig
  Cc: airlied, simona, amd-gfx, dri-devel, linux-kernel,
	Ahmed Elmetwally

When the user-supplied amdgpu.gartsize= module parameter requests a GART
aperture so large that the resulting page-table BO cannot be allocated
from VRAM, gmc_v11_0_sw_init() aborts with -ENOMEM during the kernel BO
pin in amdgpu_gart_table_vram_alloc(), which fails amdgpu probe entirely
and leaves the device without a /dev/dri node.

The GART page table is 8 bytes per AMDGPU_GPU_PAGE_SIZE-sized page,
i.e. table_size = gart_size / 512. The table BO is allocated from real
VRAM. On small-VRAM SoCs (APUs with stolen VRAM -- Strix Halo at ~1 GiB
is the case at hand) a user-supplied gartsize that produces a table BO
larger than what can be pinned in VRAM is silently accepted today and
only fails far downstream.

Reproducer on Strix Halo (gfx1151, 1 GiB stolen VRAM):

  /etc/modprobe.d/amdgpu-tuning.conf:
    options amdgpu gartsize=262144     # 256 GiB -> 512 MiB page table

  dmesg:
    amdgpu: [gmc_v11_0]*ERROR* GART aperture is needed by the driver but
            no memory has been pinned for it ...
    amdgpu 0000:c6:00.0: amdgpu: SW IP initialize failed
    amdgpu 0000:c6:00.0: amdgpu: sw_init of IP block <gmc_v11_0> failed -12

Recovery requires booting with amdgpu.gartsize=N on the kernel command
line -- i.e. the user must already know the cause, from rescue media,
with the GPU offline.

The user-supplied gartsize is a tuning hint, not a correctness
requirement. Compute the maximum sensible value such that the page-table
BO fits within real_vram_size / 8 (leaves 7/8 of VRAM for everything
else), and if the user value exceeds it, log a warning and fall back to
the per-IP auto default. With this patch the same modprobe.d entry
above produces:

  amdgpu 0000:c6:00.0: amdgpu: amdgpu.gartsize=262144 MiB exceeds device
    capacity (real_vram=1024 MiB, max sensible=65536 MiB); clamping to
    default 512 MiB

and the device probes normally.

The helper lives in amdgpu_gmc.c so the other gmc_v*_0 backends can
adopt it in follow-up patches; this patch only wires gmc_v11_0 because
that is where the failure was observed.

Signed-off-by: Ahmed Elmetwally <en22ue@gmail.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 47 +++++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h |  2 +
 drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c  |  6 +---
 3 files changed, 50 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -330,6 +330,53 @@ void amdgpu_gmc_gart_location(struct amdgpu_device *adev,
 	dev_info(adev->dev, "GART: %lluM 0x%016llX - 0x%016llX\n",
 			mc->gart_size >> 20, mc->gart_start, mc->gart_end);
 }
+
+/**
+ * amdgpu_gmc_validate_gart_size - clamp amdgpu.gartsize against VRAM capacity
+ *
+ * @adev:        amdgpu device (real_vram_size must already be populated)
+ * @user_mb:     value of the amdgpu_gart_size module parameter (MiB),
+ *               or -1 for auto
+ * @default_mb:  per-IP auto default in MiB
+ *
+ * The GART page table is allocated from VRAM and sized as
+ * gart_size / AMDGPU_GPU_PAGE_SIZE * sizeof(u64). A typoed or mis-pasted
+ * amdgpu.gartsize value (e.g. 262144 MiB on a 1 GiB-VRAM APU) produces a
+ * page-table BO that cannot be pinned, aborting GPU probe. Cap the page
+ * table at 1/8 of real VRAM, warn, and fall back to the auto default.
+ * Returns gart_size in bytes.
+ */
+u64 amdgpu_gmc_validate_gart_size(struct amdgpu_device *adev,
+				  int user_mb, u32 default_mb)
+{
+	u64 want_bytes, max_bytes;
+
+	if (user_mb == -1)
+		return (u64)default_mb << 20;
+
+	want_bytes = (u64)user_mb << 20;
+
+	/*
+	 * page-table BO bytes = gart_bytes / AMDGPU_GPU_PAGE_SIZE * 8
+	 * Constraint: page-table BO <= real_vram_size / 8
+	 *   gart_bytes <= (real_vram_size / 8) * (AMDGPU_GPU_PAGE_SIZE / 8)
+	 */
+	max_bytes = (adev->gmc.real_vram_size / 8) *
+		    (AMDGPU_GPU_PAGE_SIZE / 8);
+
+	if (want_bytes > max_bytes) {
+		dev_warn(adev->dev,
+			 "amdgpu.gartsize=%u MiB exceeds device capacity "
+			 "(real_vram=%llu MiB, max sensible=%llu MiB); "
+			 "clamping to default %u MiB\n",
+			 user_mb,
+			 adev->gmc.real_vram_size >> 20,
+			 max_bytes >> 20,
+			 default_mb);
+		return (u64)default_mb << 20;
+	}
+
+	return want_bytes;
+}

 /**
  * amdgpu_gmc_agp_location - try to find AGP location
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
@@ -416,6 +416,8 @@ void amdgpu_gmc_vram_location(struct amdgpu_device *adev, struct amdgpu_gmc *mc,
 void amdgpu_gmc_gart_location(struct amdgpu_device *adev,
 			      struct amdgpu_gmc *mc,
 			      enum amdgpu_gart_placement gart_placement);
+u64 amdgpu_gmc_validate_gart_size(struct amdgpu_device *adev,
+				  int user_mb, u32 default_mb);
 void amdgpu_gmc_agp_location(struct amdgpu_device *adev,
 			     struct amdgpu_gmc *mc);
 void amdgpu_gmc_set_agp_default(struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
@@ -708,11 +708,7 @@ static int gmc_v11_0_mc_init(struct amdgpu_device *adev)
 	if (adev->gmc.visible_vram_size > adev->gmc.real_vram_size)
 		adev->gmc.visible_vram_size = adev->gmc.real_vram_size;

-	/* set the gart size */
-	if (amdgpu_gart_size == -1)
-		adev->gmc.gart_size = 512ULL << 20;
-	else
-		adev->gmc.gart_size = (u64)amdgpu_gart_size << 20;
+	adev->gmc.gart_size = amdgpu_gmc_validate_gart_size(adev,
+				amdgpu_gart_size, 512);

 	gmc_v11_0_vram_gtt_location(adev, &adev->gmc);

--
2.45.0

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Claude review: drm/amdgpu: clamp user gartsize against device capacity
  2026-05-14 19:49 [PATCH] drm/amdgpu: clamp user gartsize against device capacity Ahmed Elmetwally
@ 2026-05-16  0:27 ` Claude Code Review Bot
  2026-05-16  0:27 ` Claude Code Review Bot
  1 sibling, 0 replies; 3+ messages in thread
From: Claude Code Review Bot @ 2026-05-16  0:27 UTC (permalink / raw)
  To: dri-devel-reviews

Overall Series Review

Subject: drm/amdgpu: clamp user gartsize against device capacity
Author: Ahmed Elmetwally <en22ue@gmail.com>
Patches: 1
Reviewed: 2026-05-16T10:27:15.093189

---

**This patch is based on a stale tree and will not apply.** The upstream code has already been refactored: the inline gart size logic in each `gmc_v*_0.c` backend was consolidated into a shared `amdgpu_gmc_set_gart_size()` helper (see `amdgpu_gmc.c:283`). The patch's diff context in `gmc_v11_0.c` expects the old inline code (`if (amdgpu_gart_size == -1) ... else ...`) which no longer exists — the current tree has a single call to `amdgpu_gmc_set_gart_size(adev, SZ_512M)` at line 712.

The underlying problem (an absurdly large `amdgpu.gartsize` parameter causing probe failure) is real and the commit message is well-written with a clear reproducer. However, the correct fix should modify the existing `amdgpu_gmc_set_gart_size()` function rather than introducing a parallel `amdgpu_gmc_validate_gart_size()` helper. This would fix all gmc backends (v6 through v12) in one shot rather than requiring per-backend follow-up patches.

Additionally, the patch has several technical issues detailed below.

---

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Claude review: drm/amdgpu: clamp user gartsize against device capacity
  2026-05-14 19:49 [PATCH] drm/amdgpu: clamp user gartsize against device capacity Ahmed Elmetwally
  2026-05-16  0:27 ` Claude review: " Claude Code Review Bot
@ 2026-05-16  0:27 ` Claude Code Review Bot
  1 sibling, 0 replies; 3+ messages in thread
From: Claude Code Review Bot @ 2026-05-16  0:27 UTC (permalink / raw)
  To: dri-devel-reviews

Patch Review

**1. Patch does not apply — based on stale tree (blocking)**

The patch modifies `gmc_v11_0.c` expecting the old inline pattern:
```c
-	/* set the gart size */
-	if (amdgpu_gart_size == -1)
-		adev->gmc.gart_size = 512ULL << 20;
-	else
-		adev->gmc.gart_size = (u64)amdgpu_gart_size << 20;
```
This code was already refactored into `amdgpu_gmc_set_gart_size()` and replaced with:
```c
	amdgpu_gmc_set_gart_size(adev, SZ_512M);
```
The patch must be rebased onto the current tree.

**2. Should modify the existing helper, not add a new one (design)**

The validation logic belongs inside the existing `amdgpu_gmc_set_gart_size()` at `amdgpu_gmc.c:283`. This would:
- Fix all gmc backends (v6–v12) automatically instead of requiring per-backend wiring
- Avoid a parallel API that does almost the same thing
- Eliminate the need for the "follow-up patches" the commit message mentions

**3. Missing `smu_prv_buffer_size` accounting (bug)**

The existing `amdgpu_gmc_set_gart_size()` adds `adev->pm.smu_prv_buffer_size` to the default GART size:
```c
	adev->gmc.gart_size = default_size + adev->pm.smu_prv_buffer_size;
```
The new `amdgpu_gmc_validate_gart_size()` returns `(u64)default_mb << 20` without this addition. This is a functional regression in the auto-default path.

**4. Falls back to default instead of clamping (questionable policy)**

When the user requests too much, the function falls back to the hard-coded default rather than clamping to the computed maximum:
```c
	if (want_bytes > max_bytes) {
		...
		return (u64)default_mb << 20;
	}
```
The function name says "clamp" and the commit message says "clamping," but it doesn't actually clamp — it resets to default. If the user asked for 64 GiB and the maximum is 65536 MiB (64 GiB), they might reasonably expect the maximum to be used, not a fall-back to 512 MiB. The user explicitly set a large value for a reason (e.g., compute workloads); clamping to `max_bytes` would better honor their intent.

**5. `user_mb` type is `int` but format specifier uses `%u` (minor)**

```c
	int user_mb, u32 default_mb)
	...
		dev_warn(adev->dev,
			 "amdgpu.gartsize=%u MiB exceeds device capacity "
```
`user_mb` is `int` but printed with `%u`. Although negative values are handled by the `== -1` check, a negative value other than -1 would print as a large unsigned number. Use `%d` for consistency with the type.

**6. Kernel style: avoid splitting quoted strings across lines (minor)**

```c
		dev_warn(adev->dev,
			 "amdgpu.gartsize=%u MiB exceeds device capacity "
			 "(real_vram=%llu MiB, max sensible=%llu MiB); "
			 "clamping to default %u MiB\n",
```
Kernel convention (per `checkpatch.pl`) prefers keeping format strings as a single searchable string even if it exceeds 80 columns, so that `grep` can find the message in the source. Consider keeping it on one line.

**7. The 1/8 VRAM limit for page table is reasonable but arbitrary (nit)**

The constraint caps the page-table BO at `real_vram_size / 8`. The math is:
- page_table_bytes = gart_bytes / 4096 * 8 = gart_bytes / 512
- constraint: gart_bytes / 512 <= real_vram / 8
- therefore: gart_bytes <= real_vram * 64

For 1 GiB VRAM: max gart = 64 GiB = 65536 MiB, which seems generous. The 1/8 reservation is not justified in the code or commit message — a brief rationale for why 1/8 was chosen would be helpful, especially since the page table only needs to be sized to fit, and the real constraint is that it must be ≤ real_vram_size (not 1/8 of it). A tighter bound (e.g., 1/4 or even 1/2) could also be justified if actual VRAM pressure from other pinned BOs is a concern.

**Summary:** The problem is valid and worth fixing, but this patch needs to be rebased and reworked to modify `amdgpu_gmc_set_gart_size()` directly. That would be a smaller, simpler change that fixes all backends at once.

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-05-16  0:27 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-14 19:49 [PATCH] drm/amdgpu: clamp user gartsize against device capacity Ahmed Elmetwally
2026-05-16  0:27 ` Claude review: " Claude Code Review Bot
2026-05-16  0:27 ` Claude Code Review Bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox