public inbox for drm-ai-reviews@public-inbox.freedesktop.org
 help / color / mirror / Atom feed
* [PATCH v4 0/6] drm/amdgpu: SVM VRAM migration via drm_pagemap (XNACK-on)
@ 2026-05-13  9:57 Junhua Shen
  2026-05-13  9:57 ` [PATCH v4 1/6] drm/amdgpu: add VRAM migration infrastructure for drm_pagemap Junhua Shen
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Junhua Shen @ 2026-05-13  9:57 UTC (permalink / raw)
  To: Alexander.Deucher, Felix.Kuehling, Christian.Koenig, Oak.Zeng,
	Jenny-Jing.Liu, Philip.Yang, Xiaogang.Chen, Ray.Huang,
	honglei1.huang, Lingshan.Zhu, simona
  Cc: amd-gfx, dri-devel, Junhua.Shen

This series adds VRAM migration support to amdgpu's SVM (Shared Virtual
Memory) implementation, using the drm_pagemap framework for ZONE_DEVICE
page management and SDMA for data migration.

This is the XNACK-on (GPU fault-driven) version of the migration
series, built on top of the drm_gpusvm-based amdgpu SVM core [1].
Previous v1/v2/v3 were XNACK-off (ioctl-driven) based on an earlier
SVM core; this v4 is a rewrite targeting the XNACK-on path.

The implementation follows the Xe driver's approach for TTM eviction,
using synchronous bo_move to migrate device-private pages back to system
RAM when TTM needs to evict SVM BOs.

Key design points:
- GPU VRAM registered as ZONE_DEVICE via devm_memremap_pages(),
  wrapped in struct amdgpu_pagemap with drm_pagemap state
- SDMA-based data transfer through GART aperture window for both
  copy_to_devmem and copy_to_ram callbacks
- amdgpu_bo_svm: lightweight BO subtype with drm_pagemap_devmem for
  ZONE_DEVICE page ownership tracking
- Synchronous TTM eviction via drm_pagemap_evict_to_ram() in bo_move,
  following the Xe pattern (no eviction fences needed)
- Migration policy driven by SVM range attributes (preferred location,
  prefetch hints) and GPU fault path

Limitations:
- Single GPU only; multi-GPU migration is not addressed
- No VRAM-to-VRAM (peer GPU) migration

Open issue:
- Unnecessary TTM system memory allocation during eviction: when TTM
  evicts an SVM BO, it allocates a destination system memory resource
  (TTM_PL_SYSTEM) before calling bo_move, then frees it afterwards.
  This allocation is unnecessary because the actual data migration is
  done via drm_pagemap_evict_to_ram() → migrate_device_* which
  migrates device-private pages directly to regular system pages,
  bypassing the TTM-allocated resource entirely. The current TTM
  framework does not support num_placement=0 to skip this redundant
  allocation; this needs further discussion.

Dependencies:

This series applies on top of the amdgpu drm_gpusvm SVM core [1].

[1] https://lore.kernel.org/amd-gfx/20260508075129.1161157-1-honglei1.huang@amd.com/

Changes since v3:
- Rebased on drm_gpusvm-based amdgpu SVM core [1], switching from
  XNACK-off ioctl-driven to XNACK-on GPU fault-driven migration
- Introduced amdgpu_bo_svm subtype with drm_pagemap_devmem embedding
  and two-layer reference counting (GEM refcount + TTM kref)
- Added synchronous TTM eviction via drm_pagemap_evict_to_ram() in
  amdgpu_bo_move(), following the Xe driver pattern
- Added amdgpu_bo_is_amdgpu_bo() check for SVM BOs in TTM path
- Cleaned up container_of macros to follow amdgpu conventions
  (to_amdgpu_bo_svm as #define, devmem_to_amdgpu_bo_svm as inline)

Changes since v2:
- Moved amdgpu_pagemap entirely to amdgpu side, eliminating all KFD
  modifications
- Split commits for better reviewability: separated infrastructure
  from SDMA callbacks, decision layer from integration
- Merged ZONE_DEVICE registration hook into the integration patch

Changes since v1:
- Dropped the eviction fence patch (was 4/6) after Christian König
  pointed out it violates the dma_fence contract
- Refactored migration integration: extracted migration logic into
  new files amdgpu_svm_range_migrate.{c,h}
- Introduced enum amdgpu_svm_migrate_mode (PREFERRED, TO_VRAM,
  TO_SYSMEM, NONE) to make migration intent explicit, replacing
  the _ex functions used in v1

Previous versions:
v1 (XNACK-off): https://lore.kernel.org/amd-gfx/20260410113146.146212-1-Junhua.Shen@amd.com/
v2 (XNACK-off): https://lore.kernel.org/amd-gfx/20260413103031.181953-1-Junhua.Shen@amd.com/
v3 (XNACK-off): https://lore.kernel.org/amd-gfx/20260427100522.7014-1-Junhua.Shen@amd.com/

Test results:
  Tested on gfx943 (MI300X) and gfx906 (MI60) with XNACK on:
  - KFD test: 95%+ passed.
  - ROCR test: all passed.

Patch overview:
  1/6  Core VRAM migration infrastructure (ZONE_DEVICE registration,
       amdgpu_pagemap, amdgpu_bo_svm subtype, drm_pagemap_ops)
  2/6  SDMA migration callbacks (copy_to_devmem, copy_to_ram,
       populate_devmem_pfn via GART aperture window)
  3/6  Synchronous TTM eviction for SVM BOs (amdgpu_svm_bo_evict
       in bo_move path, amdgpu_bo_is_amdgpu_bo check)
  4/6  SVM range migration helpers (range-level migrate_to_vram /
       migrate_to_sysmem decision layer)
  5/6  Hook up ZONE_DEVICE registration in device init and GPU reset
  6/6  Wire up VRAM migration into SVM range map and GPU fault paths

Junhua Shen (6):
  drm/amdgpu: add VRAM migration infrastructure for drm_pagemap
  drm/amdgpu: implement drm_pagemap SDMA migration callbacks
  drm/amdgpu: implement synchronous TTM eviction for SVM BOs
  drm/amdgpu: add SVM range migration helpers for drm_pagemap
  drm/amdgpu: hook up ZONE_DEVICE registration in device init and reset
  drm/amdgpu: integrate VRAM migration into SVM range map and fault
    paths

 drivers/gpu/drm/amd/amdgpu/Makefile           |   6 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu.h           |   8 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |   4 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_migrate.c   | 831 ++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_migrate.h   | 110 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c    |   4 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c     |   4 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c  |   4 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_fault.c |   9 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c |  21 +-
 .../drm/amd/amdgpu/amdgpu_svm_range_migrate.c | 122 +++
 .../drm/amd/amdgpu/amdgpu_svm_range_migrate.h |  47 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |  20 +
 13 files changed, 1181 insertions(+), 9 deletions(-)
 create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_migrate.c
 create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_migrate.h
 create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range_migrate.c
 create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range_migrate.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-05-16  2:15 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-13  9:57 [PATCH v4 0/6] drm/amdgpu: SVM VRAM migration via drm_pagemap (XNACK-on) Junhua Shen
2026-05-13  9:57 ` [PATCH v4 1/6] drm/amdgpu: add VRAM migration infrastructure for drm_pagemap Junhua Shen
2026-05-16  2:15   ` Claude review: " Claude Code Review Bot
2026-05-13  9:57 ` [PATCH v4 2/6] drm/amdgpu: implement drm_pagemap SDMA migration callbacks Junhua Shen
2026-05-16  2:15   ` Claude review: " Claude Code Review Bot
2026-05-13  9:57 ` [PATCH v4 3/6] drm/amdgpu: implement synchronous TTM eviction for SVM BOs Junhua Shen
2026-05-16  2:15   ` Claude review: " Claude Code Review Bot
2026-05-13  9:57 ` [PATCH v4 4/6] drm/amdgpu: add SVM range migration helpers for drm_pagemap Junhua Shen
2026-05-16  2:15   ` Claude review: " Claude Code Review Bot
2026-05-13  9:57 ` [PATCH v4 5/6] drm/amdgpu: hook up ZONE_DEVICE registration in device init and reset Junhua Shen
2026-05-13 13:47   ` Christian König
2026-05-14  7:33     ` Junhua Shen
2026-05-16  2:15   ` Claude review: " Claude Code Review Bot
2026-05-13  9:57 ` [PATCH v4 6/6] drm/amdgpu: integrate VRAM migration into SVM range map and fault paths Junhua Shen
2026-05-16  2:15   ` Claude review: " Claude Code Review Bot
2026-05-16  2:15 ` Claude review: drm/amdgpu: SVM VRAM migration via drm_pagemap (XNACK-on) Claude Code Review Bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox