public inbox for drm-ai-reviews@public-inbox.freedesktop.org
 help / color / mirror / Atom feed
* [PATCH] gpu: nova-core: fb: make sure to unregister SysmemFlush on boot failure
@ 2026-04-09 12:15 Eliot Courtney
  2026-04-09 22:56 ` John Hubbard
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Eliot Courtney @ 2026-04-09 12:15 UTC (permalink / raw)
  To: Danilo Krummrich, Alexandre Courbot, Alice Ryhl, David Airlie,
	Simona Vetter
  Cc: John Hubbard, Alistair Popple, Joel Fernandes, Timur Tabi,
	rust-for-linux, dri-devel, linux-kernel, Eliot Courtney

Current `Gpu::new` will not unregister SysmemFlush if something fails
after it is created, since it needs manual unregistering. Add a `Drop`
implementation which will clean it up in that case. Maintain the manual
unregister path because it can stay infallible, unlike the Drop path
which depends on revocable access. In the case that `Gpu::new` fails the
access is guaranteed to succeed, however.

Fixes: 6554ad65b589 ("gpu: nova-core: register sysmem flush page")
Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
---
 drivers/gpu/nova-core/fb.rs  | 29 ++++++++++++++++++++---------
 drivers/gpu/nova-core/gpu.rs |  7 ++++++-
 2 files changed, 26 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/nova-core/fb.rs b/drivers/gpu/nova-core/fb.rs
index bdd5eed760e1..edfbdc9a2512 100644
--- a/drivers/gpu/nova-core/fb.rs
+++ b/drivers/gpu/nova-core/fb.rs
@@ -7,6 +7,7 @@
 
 use kernel::{
     device,
+    devres::Devres,
     dma::CoherentHandle,
     fmt,
     io::Io,
@@ -16,7 +17,10 @@
         Alignment, //
     },
     sizes::*,
-    sync::aref::ARef, //
+    sync::{
+        aref::ARef,
+        Arc, //
+    },
 };
 
 use crate::{
@@ -46,12 +50,14 @@
 /// Because of this, the sysmem flush memory page must be registered as early as possible during
 /// driver initialization, and before any falcon is reset.
 ///
-/// Users are responsible for manually calling [`Self::unregister`] before dropping this object,
-/// otherwise the GPU might still use it even after it has been freed.
+/// Users should call [`Self::unregister`] before unloading to ensure unregistering is infallible.
+/// [`Drop`] performs a best-effort fallback using revocable BAR access.
 pub(crate) struct SysmemFlush {
     /// Chipset we are operating on.
     chipset: Chipset,
     device: ARef<device::Device>,
+    /// MMIO mapping of PCI BAR 0.
+    bar: Arc<Devres<Bar0>>,
     /// Keep the page alive as long as we need it.
     page: CoherentHandle,
 }
@@ -60,6 +66,7 @@ impl SysmemFlush {
     /// Allocate a memory page and register it as the sysmem flush page.
     pub(crate) fn register(
         dev: &device::Device<device::Bound>,
+        devres_bar: Arc<Devres<Bar0>>,
         bar: &Bar0,
         chipset: Chipset,
     ) -> Result<Self> {
@@ -70,18 +77,17 @@ pub(crate) fn register(
         Ok(Self {
             chipset,
             device: dev.into(),
+            bar: devres_bar,
             page,
         })
     }
 
     /// Unregister the managed sysmem flush page.
-    ///
-    /// In order to gracefully tear down the GPU, users must make sure to call this method before
-    /// dropping the object.
     pub(crate) fn unregister(&self, bar: &Bar0) {
         let hal = hal::fb_hal(self.chipset);
+        let registered_dma_handle = hal.read_sysmem_flush_page(bar);
 
-        if hal.read_sysmem_flush_page(bar) == self.page.dma_handle() {
+        if registered_dma_handle == self.page.dma_handle() {
             let _ = hal.write_sysmem_flush_page(bar, 0).inspect_err(|e| {
                 dev_warn!(
                     &self.device,
@@ -89,8 +95,7 @@ pub(crate) fn unregister(&self, bar: &Bar0) {
                     e
                 )
             });
-        } else {
-            // Another page has been registered after us for some reason - warn as this is a bug.
+        } else if registered_dma_handle != 0 {
             dev_warn!(
                 &self.device,
                 "attempt to unregister a sysmem flush page that is not active\n"
@@ -99,6 +104,12 @@ pub(crate) fn unregister(&self, bar: &Bar0) {
     }
 }
 
+impl Drop for SysmemFlush {
+    fn drop(&mut self) {
+        let _ = self.bar.try_access_with(|bar| self.unregister(bar));
+    }
+}
+
 pub(crate) struct FbRange(Range<u64>);
 
 impl FbRange {
diff --git a/drivers/gpu/nova-core/gpu.rs b/drivers/gpu/nova-core/gpu.rs
index 0f6fe9a1b955..5bad5a055b3b 100644
--- a/drivers/gpu/nova-core/gpu.rs
+++ b/drivers/gpu/nova-core/gpu.rs
@@ -257,7 +257,12 @@ pub(crate) fn new<'a>(
                     .inspect_err(|_| dev_err!(pdev, "GFW boot did not complete\n"))?;
             },
 
-            sysmem_flush: SysmemFlush::register(pdev.as_ref(), bar, spec.chipset)?,
+            sysmem_flush: SysmemFlush::register(
+                pdev.as_ref(),
+                devres_bar.clone(),
+                bar,
+                spec.chipset,
+            )?,
 
             gsp_falcon: Falcon::new(
                 pdev.as_ref(),

---
base-commit: a7a080bb4236ebe577b6776d940d1717912ff6dd
change-id: 20260409-fix-systemflush-de66dc90378a

Best regards,
--  
Eliot Courtney <ecourtney@nvidia.com>


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] gpu: nova-core: fb: make sure to unregister SysmemFlush on boot failure
  2026-04-09 12:15 [PATCH] gpu: nova-core: fb: make sure to unregister SysmemFlush on boot failure Eliot Courtney
@ 2026-04-09 22:56 ` John Hubbard
  2026-04-10 15:57 ` Gary Guo
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: John Hubbard @ 2026-04-09 22:56 UTC (permalink / raw)
  To: Eliot Courtney, Danilo Krummrich, Alexandre Courbot, Alice Ryhl,
	David Airlie, Simona Vetter
  Cc: Alistair Popple, Joel Fernandes, Timur Tabi, rust-for-linux,
	dri-devel, linux-kernel

On 4/9/26 5:15 AM, Eliot Courtney wrote:
> Current `Gpu::new` will not unregister SysmemFlush if something fails
> after it is created, since it needs manual unregistering. Add a `Drop`
> implementation which will clean it up in that case. Maintain the manual
> unregister path because it can stay infallible, unlike the Drop path
> which depends on revocable access. In the case that `Gpu::new` fails the
> access is guaranteed to succeed, however.

Hi Eliot,

The code looks exactly correct to me. I just have some tiny commit
message suggestions:

1. The subject line could be tightened up slightly, to:

    gpu: nova-core: fb: unregister SysmemFlush on boot failure

2. And I'd like to rewrite the commit message body, to approximately this:

If Gpu::new fails after SysmemFlush::register succeeds, the registered
sysmem flush page is never unregistered because SysmemFlush has no Drop
impl and try_pin_init! only drops already-initialized fields on failure.

Add a Drop impl that unregisters through revocable BAR access, which
covers the init-failure path. The manual unregister in Gpu::unbind is
still needed because by the time Drop runs during normal teardown,
devres has already revoked the BAR.

With that,

Reviewed-by: John Hubbard <jhubbard@nvidia.com>


thanks,
-- 
John Hubbard


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] gpu: nova-core: fb: make sure to unregister SysmemFlush on boot failure
  2026-04-09 12:15 [PATCH] gpu: nova-core: fb: make sure to unregister SysmemFlush on boot failure Eliot Courtney
  2026-04-09 22:56 ` John Hubbard
@ 2026-04-10 15:57 ` Gary Guo
  2026-04-12  1:22 ` Claude review: " Claude Code Review Bot
  2026-04-12  1:22 ` Claude Code Review Bot
  3 siblings, 0 replies; 5+ messages in thread
From: Gary Guo @ 2026-04-10 15:57 UTC (permalink / raw)
  To: Eliot Courtney, Danilo Krummrich, Alexandre Courbot, Alice Ryhl,
	David Airlie, Simona Vetter
  Cc: John Hubbard, Alistair Popple, Joel Fernandes, Timur Tabi,
	rust-for-linux, dri-devel, linux-kernel

On Thu Apr 9, 2026 at 1:15 PM BST, Eliot Courtney wrote:
> Current `Gpu::new` will not unregister SysmemFlush if something fails
> after it is created, since it needs manual unregistering. Add a `Drop`
> implementation which will clean it up in that case. Maintain the manual
> unregister path because it can stay infallible, unlike the Drop path
> which depends on revocable access. In the case that `Gpu::new` fails the
> access is guaranteed to succeed, however.
>
> Fixes: 6554ad65b589 ("gpu: nova-core: register sysmem flush page")
> Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
> ---
>  drivers/gpu/nova-core/fb.rs  | 29 ++++++++++++++++++++---------
>  drivers/gpu/nova-core/gpu.rs |  7 ++++++-
>  2 files changed, 26 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/gpu/nova-core/fb.rs b/drivers/gpu/nova-core/fb.rs
> index bdd5eed760e1..edfbdc9a2512 100644
> --- a/drivers/gpu/nova-core/fb.rs
> +++ b/drivers/gpu/nova-core/fb.rs
> @@ -7,6 +7,7 @@
>  
>  use kernel::{
>      device,
> +    devres::Devres,
>      dma::CoherentHandle,
>      fmt,
>      io::Io,
> @@ -16,7 +17,10 @@
>          Alignment, //
>      },
>      sizes::*,
> -    sync::aref::ARef, //
> +    sync::{
> +        aref::ARef,
> +        Arc, //
> +    },
>  };
>  
>  use crate::{
> @@ -46,12 +50,14 @@
>  /// Because of this, the sysmem flush memory page must be registered as early as possible during
>  /// driver initialization, and before any falcon is reset.
>  ///
> -/// Users are responsible for manually calling [`Self::unregister`] before dropping this object,
> -/// otherwise the GPU might still use it even after it has been freed.
> +/// Users should call [`Self::unregister`] before unloading to ensure unregistering is infallible.
> +/// [`Drop`] performs a best-effort fallback using revocable BAR access.
>  pub(crate) struct SysmemFlush {
>      /// Chipset we are operating on.
>      chipset: Chipset,
>      device: ARef<device::Device>,
> +    /// MMIO mapping of PCI BAR 0.
> +    bar: Arc<Devres<Bar0>>,
>      /// Keep the page alive as long as we need it.
>      page: CoherentHandle,
>  }
> @@ -60,6 +66,7 @@ impl SysmemFlush {
>      /// Allocate a memory page and register it as the sysmem flush page.
>      pub(crate) fn register(
>          dev: &device::Device<device::Bound>,
> +        devres_bar: Arc<Devres<Bar0>>,
>          bar: &Bar0,
>          chipset: Chipset,
>      ) -> Result<Self> {
> @@ -70,18 +77,17 @@ pub(crate) fn register(
>          Ok(Self {
>              chipset,
>              device: dev.into(),
> +            bar: devres_bar,
>              page,
>          })
>      }
>  
>      /// Unregister the managed sysmem flush page.
> -    ///
> -    /// In order to gracefully tear down the GPU, users must make sure to call this method before
> -    /// dropping the object.
>      pub(crate) fn unregister(&self, bar: &Bar0) {
>          let hal = hal::fb_hal(self.chipset);
> +        let registered_dma_handle = hal.read_sysmem_flush_page(bar);
>  
> -        if hal.read_sysmem_flush_page(bar) == self.page.dma_handle() {
> +        if registered_dma_handle == self.page.dma_handle() {
>              let _ = hal.write_sysmem_flush_page(bar, 0).inspect_err(|e| {
>                  dev_warn!(
>                      &self.device,
> @@ -89,8 +95,7 @@ pub(crate) fn unregister(&self, bar: &Bar0) {
>                      e
>                  )
>              });
> -        } else {
> -            // Another page has been registered after us for some reason - warn as this is a bug.
> +        } else if registered_dma_handle != 0 {
>              dev_warn!(
>                  &self.device,
>                  "attempt to unregister a sysmem flush page that is not active\n"
> @@ -99,6 +104,12 @@ pub(crate) fn unregister(&self, bar: &Bar0) {
>      }
>  }
>  
> +impl Drop for SysmemFlush {
> +    fn drop(&mut self) {
> +        let _ = self.bar.try_access_with(|bar| self.unregister(bar));

I feel that this is the wrong solution to the problem.

The thing we want is to *ensure* that `SysmemFlush` Drop is called with device
still being bound.

It's not yet fully clear to me how we'd want to guarantee that, but one API that
might make sense is to create a DevRes API that allows you to reference an
existing `DevRes` and have driver-core making sure that the tear down happens in
reverse order. So inside the `Drop` the `bar` can still be unconditionally
access.

Best,
Gary

> +    }
> +}
> +
>  pub(crate) struct FbRange(Range<u64>);
>  
>  impl FbRange {
> diff --git a/drivers/gpu/nova-core/gpu.rs b/drivers/gpu/nova-core/gpu.rs
> index 0f6fe9a1b955..5bad5a055b3b 100644
> --- a/drivers/gpu/nova-core/gpu.rs
> +++ b/drivers/gpu/nova-core/gpu.rs
> @@ -257,7 +257,12 @@ pub(crate) fn new<'a>(
>                      .inspect_err(|_| dev_err!(pdev, "GFW boot did not complete\n"))?;
>              },
>  
> -            sysmem_flush: SysmemFlush::register(pdev.as_ref(), bar, spec.chipset)?,
> +            sysmem_flush: SysmemFlush::register(
> +                pdev.as_ref(),
> +                devres_bar.clone(),
> +                bar,
> +                spec.chipset,
> +            )?,
>  
>              gsp_falcon: Falcon::new(
>                  pdev.as_ref(),
>
> ---
> base-commit: a7a080bb4236ebe577b6776d940d1717912ff6dd
> change-id: 20260409-fix-systemflush-de66dc90378a
>
> Best regards,
> --  
> Eliot Courtney <ecourtney@nvidia.com>


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Claude review: gpu: nova-core: fb: make sure to unregister SysmemFlush on boot failure
  2026-04-09 12:15 [PATCH] gpu: nova-core: fb: make sure to unregister SysmemFlush on boot failure Eliot Courtney
  2026-04-09 22:56 ` John Hubbard
  2026-04-10 15:57 ` Gary Guo
@ 2026-04-12  1:22 ` Claude Code Review Bot
  2026-04-12  1:22 ` Claude Code Review Bot
  3 siblings, 0 replies; 5+ messages in thread
From: Claude Code Review Bot @ 2026-04-12  1:22 UTC (permalink / raw)
  To: dri-devel-reviews

Overall Series Review

Subject: gpu: nova-core: fb: make sure to unregister SysmemFlush on boot failure
Author: Eliot Courtney <ecourtney@nvidia.com>
Patches: 3
Reviewed: 2026-04-12T11:22:40.453075

---

This is a single-patch fix for a real resource leak bug in `nova-core`. When `Gpu::new` fails after `SysmemFlush::register()` succeeds (e.g., during falcon creation or GSP boot), the sysmem flush page is leaked: it is freed by Rust's automatic drop of `CoherentHandle`, but the GPU hardware register still points to the now-freed DMA address. This could lead to the GPU writing to stale/reused memory.

The approach is sound: add a `Drop` impl that performs best-effort cleanup via revocable BAR access, while preserving the existing infallible manual `unregister()` path for normal teardown. The guard against double-unregister warnings (`registered_dma_handle != 0`) is a necessary and correct consequence of having both manual and automatic cleanup paths.

**Verdict: The patch is correct and should be accepted.** I have one minor observation below but no blocking issues.

---

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Claude review: gpu: nova-core: fb: make sure to unregister SysmemFlush on boot failure
  2026-04-09 12:15 [PATCH] gpu: nova-core: fb: make sure to unregister SysmemFlush on boot failure Eliot Courtney
                   ` (2 preceding siblings ...)
  2026-04-12  1:22 ` Claude review: " Claude Code Review Bot
@ 2026-04-12  1:22 ` Claude Code Review Bot
  3 siblings, 0 replies; 5+ messages in thread
From: Claude Code Review Bot @ 2026-04-12  1:22 UTC (permalink / raw)
  To: dri-devel-reviews

Patch Review

**The bug being fixed is real and important.** Looking at `Gpu::new` (gpu.rs:249-275), the `try_pin_init!` macro initializes fields sequentially. If any field after `sysmem_flush` fails (e.g., `Falcon::new` at line 262, `Gsp::boot` at line 272), the already-initialized `SysmemFlush` is dropped. Without a `Drop` impl, the DMA page is freed but the GPU register `NV_PFB_NISO_FLUSH_SYSMEM_ADDR` still points to it.

**Adding `Arc<Devres<Bar0>>` to `SysmemFlush`:**

```rust
pub(crate) struct SysmemFlush {
    chipset: Chipset,
    device: ARef<device::Device>,
+   bar: Arc<Devres<Bar0>>,
    page: CoherentHandle,
}
```

This is the right approach. `SysmemFlush` needs independent access to the BAR for its `Drop` impl, since it can't rely on receiving a `&Bar0` reference as a parameter during drop. The `Arc` clone is cheap and guarantees the `Devres<Bar0>` outlives the `SysmemFlush` even if `Gpu`'s own `bar` field is dropped first (Rust drops struct fields in declaration order, and `bar` is declared before `sysmem_flush` in the `Gpu` struct).

**The `Drop` implementation:**

```rust
impl Drop for SysmemFlush {
    fn drop(&mut self) {
        let _ = self.bar.try_access_with(|bar| self.unregister(bar));
    }
}
```

This is correct. `try_access_with` returns `None` if the `Devres` has been revoked (device unbound), in which case the hardware is already torn down and unregistering is unnecessary. The `let _ = ...` appropriately discards the `Option<()>` result.

**The `registered_dma_handle != 0` guard:**

```rust
-        } else {
-            // Another page has been registered after us for some reason - warn as this is a bug.
+        } else if registered_dma_handle != 0 {
```

This is a necessary change to avoid a spurious warning on the normal unbind path. After `unbind()` calls `unregister()` (writing 0 to the register), the subsequent `Drop` would read 0, which doesn't match `self.page.dma_handle()`. Without this guard, it would incorrectly warn "attempt to unregister a sysmem flush page that is not active." The new behavior correctly distinguishes three cases:
1. Register matches our page → unregister (write 0)
2. Register holds a different non-zero address → warn (genuine bug)
3. Register is 0 → silently do nothing (already cleaned up)

**Minor observation (not blocking):** The doc comment update is good but could be slightly more precise:

```rust
+/// Users should call [`Self::unregister`] before unloading to ensure unregistering is infallible.
+/// [`Drop`] performs a best-effort fallback using revocable BAR access.
```

The comment says "before unloading" but the actual call site is in `Gpu::unbind`, which happens at unbind time. "Before unloading" could be confused with module unload. A more precise wording might be "during unbind" but this is a very minor documentation nit.

**No issues with the `Gpu::new` call site change:**

```rust
-            sysmem_flush: SysmemFlush::register(pdev.as_ref(), bar, spec.chipset)?,
+            sysmem_flush: SysmemFlush::register(
+                pdev.as_ref(),
+                devres_bar.clone(),
+                bar,
+                spec.chipset,
+            )?,
```

The `devres_bar.clone()` correctly creates a new `Arc` reference for `SysmemFlush` to own. The original `devres_bar` is later moved into `bar: devres_bar` (line 274). If initialization fails before that point, both `Arc`s are dropped independently and the `Devres<Bar0>` refcount goes to zero, properly cleaning up.

**Overall: Clean, well-reasoned fix with no correctness issues.**

---
Generated by Claude Code Patch Reviewer

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-04-12  1:22 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-09 12:15 [PATCH] gpu: nova-core: fb: make sure to unregister SysmemFlush on boot failure Eliot Courtney
2026-04-09 22:56 ` John Hubbard
2026-04-10 15:57 ` Gary Guo
2026-04-12  1:22 ` Claude review: " Claude Code Review Bot
2026-04-12  1:22 ` Claude Code Review Bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox