* [PATCH 0/2] gpu: host1x: syncpt_wait micro-optimizations
@ 2026-05-14 10:31 Tanmay Patil
2026-05-14 10:31 ` [PATCH 1/2] gpu: host1x: skip redundant syncpoint loads in host1x_syncpt_wait() Tanmay Patil
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Tanmay Patil @ 2026-05-14 10:31 UTC (permalink / raw)
To: Thierry Reding, Mikko Perttunen
Cc: David Airlie, Simona Vetter, dri-devel, linux-tegra, linux-kernel,
Tanmay Patil
This series reduces the latency in host1x syncpoint wait path.
Patch 1 removes redundant MMIO reads in host1x_syncpt_wait().
Patch 2 skips the host1x_intr_update_hw_state() call in the ISR
when no fences remain.
Measured syncpoint wait latency (50000 samples):
Average latency: 12.2 us -> 9.4 us
99.99 pct latency: 62.96 us -> 36.58 us
Tanmay Patil (2):
gpu: host1x: skip redundant syncpoint loads in host1x_syncpt_wait()
gpu: host1x: skip redundant HW state update
drivers/gpu/host1x/intr.c | 8 ++++++--
drivers/gpu/host1x/syncpt.c | 23 ++++++++++++++---------
2 files changed, 20 insertions(+), 11 deletions(-)
--
2.54.0
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 1/2] gpu: host1x: skip redundant syncpoint loads in host1x_syncpt_wait()
2026-05-14 10:31 [PATCH 0/2] gpu: host1x: syncpt_wait micro-optimizations Tanmay Patil
@ 2026-05-14 10:31 ` Tanmay Patil
2026-05-16 1:02 ` Claude review: " Claude Code Review Bot
2026-05-14 10:31 ` [PATCH 2/2] gpu: host1x: skip redundant HW state update Tanmay Patil
2026-05-16 1:02 ` Claude review: gpu: host1x: syncpt_wait micro-optimizations Claude Code Review Bot
2 siblings, 1 reply; 6+ messages in thread
From: Tanmay Patil @ 2026-05-14 10:31 UTC (permalink / raw)
To: Thierry Reding, Mikko Perttunen
Cc: David Airlie, Simona Vetter, dri-devel, linux-tegra, linux-kernel,
Tanmay Patil
In host1x_syncpt_wait(), the hardware syncpoint value was loaded
initially for expiry check, and then loaded a second time to
populate the caller's value pointer. Reuse a single load for
both purposes.
After dma_fence_wait_timeout(), the previous code reloaded the syncpoint
value for the expiry check, which is only required in the timeout case.
On success (i.e., return value > 0, or return value == 0 with zero
jiffies remaining), the ISR has already cached the value before
signaling the fence. The value pointer can therefore be populated using
the cached value using host1x_syncpt_read_min() without MMIO access.
Only the timeout path requires a fresh load, move host1x_syncpt_load()
under that path.
Measured Syncpoint wait latency (50000 samples):
Average latency: 12.2 us -> 10.6 us
99.99 pct latency: 62.96 us -> 51.90 us
Signed-off-by: Tanmay Patil <tanmayp@nvidia.com>
---
drivers/gpu/host1x/syncpt.c | 23 ++++++++++++++---------
1 file changed, 14 insertions(+), 9 deletions(-)
diff --git a/drivers/gpu/host1x/syncpt.c b/drivers/gpu/host1x/syncpt.c
index acc7d82e0585..807c74fc6a0a 100644
--- a/drivers/gpu/host1x/syncpt.c
+++ b/drivers/gpu/host1x/syncpt.c
@@ -222,11 +222,12 @@ int host1x_syncpt_wait(struct host1x_syncpt *sp, u32 thresh, long timeout,
{
struct dma_fence *fence;
long wait_err;
+ u32 curr;
- host1x_hw_syncpt_load(sp->host, sp);
+ curr = host1x_syncpt_load(sp);
if (value)
- *value = host1x_syncpt_load(sp);
+ *value = curr;
if (host1x_syncpt_is_expired(sp, thresh))
return 0;
@@ -245,21 +246,25 @@ int host1x_syncpt_wait(struct host1x_syncpt *sp, u32 thresh, long timeout,
host1x_fence_cancel(fence);
dma_fence_put(fence);
- if (value)
- *value = host1x_syncpt_load(sp);
-
/*
* Don't rely on dma_fence_wait_timeout return value,
* since it returns zero both on timeout and if the
* wait completed with 0 jiffies left.
*/
- host1x_hw_syncpt_load(sp->host, sp);
- if (wait_err == 0 && !host1x_syncpt_is_expired(sp, thresh))
+ if (wait_err == 0 && !host1x_syncpt_is_expired(sp, thresh)) {
+ if (value)
+ *value = host1x_syncpt_load(sp);
+
return -EAGAIN;
- else if (wait_err < 0)
+ } else if (wait_err < 0) {
return wait_err;
- else
+ } else {
+ /* Success, read the value cached by ISR */
+ if (value)
+ *value = host1x_syncpt_read_min(sp);
+
return 0;
+ }
}
EXPORT_SYMBOL(host1x_syncpt_wait);
--
2.54.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 2/2] gpu: host1x: skip redundant HW state update
2026-05-14 10:31 [PATCH 0/2] gpu: host1x: syncpt_wait micro-optimizations Tanmay Patil
2026-05-14 10:31 ` [PATCH 1/2] gpu: host1x: skip redundant syncpoint loads in host1x_syncpt_wait() Tanmay Patil
@ 2026-05-14 10:31 ` Tanmay Patil
2026-05-16 1:02 ` Claude review: " Claude Code Review Bot
2026-05-16 1:02 ` Claude review: gpu: host1x: syncpt_wait micro-optimizations Claude Code Review Bot
2 siblings, 1 reply; 6+ messages in thread
From: Tanmay Patil @ 2026-05-14 10:31 UTC (permalink / raw)
To: Thierry Reding, Mikko Perttunen
Cc: David Airlie, Simona Vetter, dri-devel, linux-tegra, linux-kernel,
Tanmay Patil
When the fence list is empty, host1x_intr_update_hw_state()
falls through to host1x_intr_disable_syncpt_intr()
which does two MMIO writes to disable the syncpoint
interrupt and clear its status.
The ISR has already disabled and acked the interrupt
before calling host1x_intr_handle_interrupt(), making
these two writes redundant. Skip the update_hw_state()
call if no fences remain.
Measured Syncpoint wait latency (50000 samples):
Average latency: 10.6 us -> 9.4 us
99.99 pct latency: 51.90 us -> 36.58 us
Signed-off-by: Tanmay Patil <tanmayp@nvidia.com>
---
drivers/gpu/host1x/intr.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/host1x/intr.c b/drivers/gpu/host1x/intr.c
index f77a678949e9..723297250768 100644
--- a/drivers/gpu/host1x/intr.c
+++ b/drivers/gpu/host1x/intr.c
@@ -92,8 +92,12 @@ void host1x_intr_handle_interrupt(struct host1x *host, unsigned int id)
host1x_fence_signal(fence);
}
- /* Re-enable interrupt if necessary */
- host1x_intr_update_hw_state(host, sp);
+ /*
+ * Re-enable interrupt if necessary. The ISR already disabled the interrupt,
+ * so if no fences remain, no update is needed.
+ */
+ if (!list_empty(&sp->fences.list))
+ host1x_intr_update_hw_state(host, sp);
spin_unlock(&sp->fences.lock);
}
--
2.54.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Claude review: gpu: host1x: syncpt_wait micro-optimizations
2026-05-14 10:31 [PATCH 0/2] gpu: host1x: syncpt_wait micro-optimizations Tanmay Patil
2026-05-14 10:31 ` [PATCH 1/2] gpu: host1x: skip redundant syncpoint loads in host1x_syncpt_wait() Tanmay Patil
2026-05-14 10:31 ` [PATCH 2/2] gpu: host1x: skip redundant HW state update Tanmay Patil
@ 2026-05-16 1:02 ` Claude Code Review Bot
2 siblings, 0 replies; 6+ messages in thread
From: Claude Code Review Bot @ 2026-05-16 1:02 UTC (permalink / raw)
To: dri-devel-reviews
Overall Series Review
Subject: gpu: host1x: syncpt_wait micro-optimizations
Author: Tanmay Patil <tanmayp@nvidia.com>
Patches: 3
Reviewed: 2026-05-16T11:02:36.722039
---
This is a clean, well-motivated 2-patch micro-optimization series for the host1x syncpoint wait path. The measured latency improvements are solid (12.2us→9.4us average, ~42% reduction in p99.99) and the patches are logically ordered — Patch 1 eliminates redundant MMIO reads in `host1x_syncpt_wait()`, Patch 2 skips an unnecessary HW state update in the ISR when no fences remain.
Both patches are correct. The reasoning in the commit messages is sound and matches what the code does. The series is small, focused, and unlikely to introduce regressions.
**Verdict: Series looks good to merge.**
---
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 6+ messages in thread
* Claude review: gpu: host1x: skip redundant syncpoint loads in host1x_syncpt_wait()
2026-05-14 10:31 ` [PATCH 1/2] gpu: host1x: skip redundant syncpoint loads in host1x_syncpt_wait() Tanmay Patil
@ 2026-05-16 1:02 ` Claude Code Review Bot
0 siblings, 0 replies; 6+ messages in thread
From: Claude Code Review Bot @ 2026-05-16 1:02 UTC (permalink / raw)
To: dri-devel-reviews
Patch Review
**Summary:** Eliminates redundant MMIO register reads in the syncpoint wait path by (a) reusing the initial `host1x_syncpt_load()` result for both the expiry check and the caller's value pointer, and (b) after `dma_fence_wait_timeout()`, only issuing an MMIO load in the timeout path, using the ISR-cached `host1x_syncpt_read_min()` on success.
**Analysis:**
The original code called `host1x_hw_syncpt_load()` to update the cache, then `host1x_syncpt_load()` to read it again — two MMIO reads where one suffices. The patch replaces this with:
```c
curr = host1x_syncpt_load(sp);
if (value)
*value = curr;
```
This is correct. `host1x_syncpt_load()` already calls `host1x_hw_syncpt_load()` internally (`syncpt.c:187`), so the original `host1x_hw_syncpt_load` + `host1x_syncpt_load` pair was indeed doing two MMIO reads.
For the post-wait path, the key insight is sound: on success, the ISR in `host1x_intr_handle_interrupt()` calls `host1x_syncpt_load()` at line 81 of `intr.c`, which updates `min_val` via `atomic_cmpxchg`. So `host1x_syncpt_read_min()` (which just does `atomic_read(&sp->min_val)` with an `smp_rmb()`) returns the value the ISR already read from hardware.
The timeout disambiguation path (`wait_err == 0 && !host1x_syncpt_is_expired`) still does an MMIO load, which is correct — we need a fresh hardware read to distinguish real timeout from zero-jiffies-left completion. Note that `host1x_syncpt_is_expired()` reads `min_val` (line 280), so the `host1x_syncpt_load()` that follows in the `if (value)` block does perform a fresh MMIO read for the caller. However, there's a subtlety: `host1x_syncpt_is_expired()` uses the cached `min_val` — it does not do an MMIO read itself. The original code called `host1x_hw_syncpt_load()` *before* `host1x_syncpt_is_expired()` to refresh the cache. The new code calls `host1x_syncpt_is_expired()` without a preceding MMIO load.
**Potential concern:** After `dma_fence_wait_timeout()` returns 0, the cached `min_val` may have been updated by the ISR, or it may still hold the pre-wait value. If the ISR fired and signaled the fence but `dma_fence_wait_timeout` returned 0 (because jiffies ran out), then `min_val` was updated by the ISR and `host1x_syncpt_is_expired()` will correctly see it as expired. If the ISR didn't fire (true timeout), `min_val` was not updated. But the syncpoint may have been incremented by HW without triggering the interrupt (since the threshold might not have been reached for a different fence, or the increment may have happened after the interrupt was serviced). In the original code, `host1x_hw_syncpt_load()` refreshed the cache before the expiry check, catching this case.
With the patch, if `wait_err == 0` and the syncpoint actually did reach `thresh` but `min_val` wasn't updated (because the ISR didn't run for this specific event after the fence was cancelled), `host1x_syncpt_is_expired()` would return false and we'd incorrectly return `-EAGAIN`. However, looking more carefully: if the fence was not signaled by the ISR, then the syncpoint truly did not reach the threshold during the wait window — that's a legitimate timeout. The fence monitors exactly this threshold, and the ISR calls `host1x_syncpt_load()` to refresh `min_val` before checking. If the ISR didn't signal the fence, the value hasn't been reached. The `host1x_fence_cancel()` call on line 246 removes the fence from the list but doesn't refresh `min_val`. So the question is: can the syncpoint reach `thresh` after the ISR fires but before `host1x_fence_cancel`? If so, we'd miss it.
Actually, reconsidering: the old code's `host1x_hw_syncpt_load()` was a safety net for exactly this race — a last-chance MMIO read to catch increments that arrived after the ISR's snapshot. The new code relies solely on the ISR's cached value. In the `wait_err == 0` path, if the fence timed out, there's a small window where the syncpoint could have been incremented after the ISR last ran. The old code would catch this; the new code would miss it and return `-EAGAIN` instead of 0.
**However**, this is arguably not a correctness bug — it's a timeout, and the caller gets `-EAGAIN`, which is the expected timeout behavior. The syncpoint value they get (`host1x_syncpt_load(sp)` in the EAGAIN path) would be correct. This edge case existed in the original code too — it's inherently racy. The original code just had a slightly wider correctness window. In practice the difference is negligible: if the syncpoint reaches the threshold during the narrow window between the ISR snapshot and the `host1x_syncpt_is_expired` check, reporting timeout is acceptable since we're already at the timeout boundary.
**Minor nit:** The `*value` assignment in the `-EAGAIN` path does an MMIO read even though we're about to return an error. This is fine for debugging purposes, but the caller might not even look at `*value` on error. Not worth changing though.
**Verdict: Patch is correct and well-reasoned. The behavioral change in the timeout-race corner case is acceptable.**
---
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 6+ messages in thread
* Claude review: gpu: host1x: skip redundant HW state update
2026-05-14 10:31 ` [PATCH 2/2] gpu: host1x: skip redundant HW state update Tanmay Patil
@ 2026-05-16 1:02 ` Claude Code Review Bot
0 siblings, 0 replies; 6+ messages in thread
From: Claude Code Review Bot @ 2026-05-16 1:02 UTC (permalink / raw)
To: dri-devel-reviews
Patch Review
**Summary:** In `host1x_intr_handle_interrupt()`, skip the call to `host1x_intr_update_hw_state()` when no fences remain, since the ISR has already disabled and acknowledged the interrupt.
**Analysis:**
Looking at `process_32_syncpts()` in `intr_hw.c:14-26`:
```c
host1x_sync_writel(host, val, HOST1X_SYNC_SYNCPT_THRESH_INT_DISABLE(reg_offset));
host1x_sync_writel(host, val, HOST1X_SYNC_SYNCPT_THRESH_CPU0_INT_STATUS(reg_offset));
for_each_set_bit(id, &val, 32)
host1x_intr_handle_interrupt(host, reg_offset * 32 + id);
```
The ISR writes to `INT_DISABLE` and `INT_STATUS` before calling `host1x_intr_handle_interrupt()`. When the fence list is empty after processing, `host1x_intr_update_hw_state()` would call `host1x_intr_disable_syncpt_intr()` (`intr_hw.c:138-145`), which does exactly the same two writes:
```c
host1x_sync_writel(host, BIT(id % 32), HOST1X_SYNC_SYNCPT_THRESH_INT_DISABLE(id / 32));
host1x_sync_writel(host, BIT(id % 32), HOST1X_SYNC_SYNCPT_THRESH_CPU0_INT_STATUS(id / 32));
```
So the optimization is correct: when the list is empty, `host1x_intr_update_hw_state` would just repeat the disable+ack that already happened. The patch skips this redundant work:
```c
if (!list_empty(&sp->fences.list))
host1x_intr_update_hw_state(host, sp);
```
When fences remain, `host1x_intr_update_hw_state` re-programs the threshold to the next fence's value and re-enables the interrupt — this path is still taken.
**One consideration:** If a new fence is added between the ISR's disable/ack and the `list_empty` check, could we miss the re-enable? No — `host1x_intr_add_fence_locked()` calls `host1x_intr_update_hw_state()` itself, and it takes `sp->fences.lock`, which `host1x_intr_handle_interrupt` holds during this check. So there's no race.
**Verdict: Clean, correct optimization. No issues.**
---
Generated by Claude Code Patch Reviewer
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-05-16 1:02 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-14 10:31 [PATCH 0/2] gpu: host1x: syncpt_wait micro-optimizations Tanmay Patil
2026-05-14 10:31 ` [PATCH 1/2] gpu: host1x: skip redundant syncpoint loads in host1x_syncpt_wait() Tanmay Patil
2026-05-16 1:02 ` Claude review: " Claude Code Review Bot
2026-05-14 10:31 ` [PATCH 2/2] gpu: host1x: skip redundant HW state update Tanmay Patil
2026-05-16 1:02 ` Claude review: " Claude Code Review Bot
2026-05-16 1:02 ` Claude review: gpu: host1x: syncpt_wait micro-optimizations Claude Code Review Bot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox