[PATCH v7 11/29] drm/sched: Favour interactive clients slightly

public inbox for drm-ai-reviews@public-inbox.freedesktop.org
 help / color / mirror / Atom feed

From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
To: amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org
Cc: kernel-dev@igalia.com, intel-xe@lists.freedesktop.org,
	Danilo Krummrich <dakr@kernel.org>,
	Philipp Stanner <phasta@kernel.org>,
	Tvrtko Ursulin <tvrtko.ursulin@igalia.com>,
	Christian König <christian.koenig@amd.com>,
	Matthew Brost <matthew.brost@intel.com>,
	Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Subject: [PATCH v7 11/29] drm/sched: Favour interactive clients slightly
Date: Fri,  6 Mar 2026 16:34:27 +0000	[thread overview]
Message-ID: <20260306163445.97243-12-tvrtko.ursulin@igalia.com> (raw)
In-Reply-To: <20260306163445.97243-1-tvrtko.ursulin@igalia.com>

GPUs do not always implement preemption and DRM scheduler definitely
does not support it at the front end scheduling level. This means
execution quanta can be quite long and is controlled by userspace,
consequence of which is picking the "wrong" entity to run can have a
larger negative effect than it would have with a virtual runtime based CPU
scheduler.

Another important consideration is that rendering clients often have
shallow submission queues, meaning they will be entering and exiting the
scheduler's runnable queue often.

Relevant scenario here is what happens when an entity re-joins the
runnable queue with other entities already present. One cornerstone of the
virtual runtime algorithm is to let it re-join at the head and rely on the
virtual runtime accounting and timeslicing to sort it out.

However, as explained above, this may not work perfectly in the GPU world.
Entity could always get to overtake the existing entities, or not,
depending on the submission order and rbtree equal key insertion
behaviour.

Allow interactive jobs to overtake entities already queued up for the
limited case when interactive entity is re-joining the queue after
being idle.

This gives more opportunity for the compositors to have their rendering
executed before the GPU hogs even if they have been configured with the
same scheduling priority.

To classify a client as interactive we look at its average job duration
versus the average for the whole scheduler. We can track this easily by
plugging into the existing job runtime tracking and applying the
exponential moving average window on the past submissions. Then, all other
things being equal, we let the more interactive jobs go first.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Philipp Stanner <phasta@kernel.org>
Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Acked-by: Danilo Krummrich <dakr@kernel.org>
---
 drivers/gpu/drm/scheduler/sched_entity.c   | 13 ++++++++++---
 drivers/gpu/drm/scheduler/sched_internal.h |  5 ++++-
 drivers/gpu/drm/scheduler/sched_main.c     |  8 +++++++-
 drivers/gpu/drm/scheduler/sched_rq.c       | 22 ++++++++++++++++++++--
 include/drm/gpu_scheduler.h                |  5 +++++
 5 files changed, 46 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 94438f80b00d..add99c609faf 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -60,6 +60,7 @@ static struct drm_sched_entity_stats *drm_sched_entity_stats_new(void)
 
 	kref_init(&stats->kref);
 	spin_lock_init(&stats->lock);
+	ewma_drm_sched_avgtime_init(&stats->avg_job_us);
 
 	return stats;
 }
@@ -69,19 +70,25 @@ static struct drm_sched_entity_stats *drm_sched_entity_stats_new(void)
  * @job: Scheduler job to account.
  *
  * Accounts the execution time of @job to its respective entity stats object.
+ *
+ * Return: Job's real duration in micro seconds.
  */
-void drm_sched_entity_stats_job_add_gpu_time(struct drm_sched_job *job)
+ktime_t drm_sched_entity_stats_job_add_gpu_time(struct drm_sched_job *job)
 {
 	struct drm_sched_entity_stats *stats = job->entity_stats;
 	struct drm_sched_fence *s_fence = job->s_fence;
-	ktime_t start, end;
+	ktime_t start, end, duration;
 
 	start = dma_fence_timestamp(&s_fence->scheduled);
 	end = dma_fence_timestamp(&s_fence->finished);
+	duration = ktime_sub(end, start);
 
 	spin_lock(&stats->lock);
-	stats->runtime = ktime_add(stats->runtime, ktime_sub(end, start));
+	stats->runtime = ktime_add(stats->runtime, duration);
+	ewma_drm_sched_avgtime_add(&stats->avg_job_us, ktime_to_us(duration));
 	spin_unlock(&stats->lock);
+
+	return duration;
 }
 
 /**
diff --git a/drivers/gpu/drm/scheduler/sched_internal.h b/drivers/gpu/drm/scheduler/sched_internal.h
index a682e0cfbf25..ba5a9d9786e1 100644
--- a/drivers/gpu/drm/scheduler/sched_internal.h
+++ b/drivers/gpu/drm/scheduler/sched_internal.h
@@ -14,6 +14,7 @@
  * @runtime: time entity spent on the GPU.
  * @prev_runtime: previous @runtime used to get the runtime delta.
  * @vruntime: virtual runtime as accumulated by the fair algorithm.
+ * @avg_job_us: average job duration.
  *
  * Because jobs and entities have decoupled lifetimes, ie. we cannot access the
  * entity once the job has been de-queued, and we do need know how much GPU time
@@ -26,6 +27,8 @@ struct drm_sched_entity_stats {
 	ktime_t		runtime;
 	ktime_t		prev_runtime;
 	ktime_t		vruntime;
+
+	struct ewma_drm_sched_avgtime   avg_job_us;
 };
 
 /* Used to choose between FIFO and RR job-scheduling */
@@ -146,6 +149,6 @@ drm_sched_entity_stats_put(struct drm_sched_entity_stats *stats)
 	kref_put(&stats->kref, drm_sched_entity_stats_release);
 }
 
-void drm_sched_entity_stats_job_add_gpu_time(struct drm_sched_job *job);
+ktime_t drm_sched_entity_stats_job_add_gpu_time(struct drm_sched_job *job);
 
 #endif
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 0aca41b4e334..337db7b1e688 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1004,7 +1004,12 @@ static void drm_sched_free_job_work(struct work_struct *w)
 	struct drm_sched_job *job;
 
 	while ((job = drm_sched_get_finished_job(sched))) {
-		drm_sched_entity_stats_job_add_gpu_time(job);
+		ktime_t duration = drm_sched_entity_stats_job_add_gpu_time(job);
+
+		/* Serialized by the worker. */
+		ewma_drm_sched_avgtime_add(&sched->avg_job_us,
+					   ktime_to_us(duration));
+
 		sched->ops->free_job(job);
 	}
 
@@ -1165,6 +1170,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
 	atomic_set(&sched->_score, 0);
 	atomic64_set(&sched->job_id_count, 0);
 	sched->pause_submit = false;
+	ewma_drm_sched_avgtime_init(&sched->avg_job_us);
 
 	sched->ready = true;
 	return 0;
diff --git a/drivers/gpu/drm/scheduler/sched_rq.c b/drivers/gpu/drm/scheduler/sched_rq.c
index 2fde309d02a6..f8bc60026a73 100644
--- a/drivers/gpu/drm/scheduler/sched_rq.c
+++ b/drivers/gpu/drm/scheduler/sched_rq.c
@@ -161,13 +161,21 @@ drm_sched_entity_restore_vruntime(struct drm_sched_entity *entity,
 				  enum drm_sched_priority rq_prio)
 {
 	struct drm_sched_entity_stats *stats = entity->stats;
+	struct drm_gpu_scheduler *sched = entity->rq->sched;
 	enum drm_sched_priority prio = entity->priority;
+	unsigned long avg_us, sched_avg_us;
 	ktime_t vruntime;
 
 	BUILD_BUG_ON(DRM_SCHED_PRIORITY_NORMAL < DRM_SCHED_PRIORITY_HIGH);
 
 	spin_lock(&stats->lock);
 	vruntime = stats->vruntime;
+	avg_us = ewma_drm_sched_avgtime_read(&stats->avg_job_us);
+	/*
+	 * Unlocked read of the scheduler average is fine since it is just
+	 * heuristics and data type is a natural word size.
+	 */
+	sched_avg_us = ewma_drm_sched_avgtime_read(&sched->avg_job_us);
 
 	/*
 	 * Special handling for entities which were picked from the top of the
@@ -177,14 +185,24 @@ drm_sched_entity_restore_vruntime(struct drm_sched_entity *entity,
 		if (prio > rq_prio) {
 			/*
 			 * Lower priority should not overtake higher when re-
-			 * joining at the top of the queue.
+			 * joining at the top of the queue so push it back
+			 * somewhere behind the "middle" of the run-queue,
+			 * proportional to the scheduler and entity average job
+			 * durations.
 			 */
-			vruntime = ns_to_ktime(prio - rq_prio);
+			vruntime = us_to_ktime((1 + avg_us + sched_avg_us) <<
+					       vruntime_shift[prio]);
 		} else if (prio < rq_prio) {
 			/*
 			 * Higher priority can go first.
 			 */
 			vruntime = -ns_to_ktime(rq_prio - prio);
+		} else {
+			/* Favour entity with shorter jobs (interactivity). */
+			if (avg_us <= sched_avg_us)
+				vruntime = -ns_to_ktime(1);
+			else
+				vruntime = ns_to_ktime(1);
 		}
 	}
 
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index dd514a413156..826de4cbbf2d 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -25,11 +25,14 @@
 #define _DRM_GPU_SCHEDULER_H_
 
 #include <drm/spsc_queue.h>
+#include <linux/average.h>
 #include <linux/dma-fence.h>
 #include <linux/completion.h>
 #include <linux/xarray.h>
 #include <linux/workqueue.h>
 
+DECLARE_EWMA(drm_sched_avgtime, 6, 4);
+
 #define MAX_WAIT_SCHED_ENTITY_Q_EMPTY msecs_to_jiffies(1000)
 
 /**
@@ -582,6 +585,7 @@ struct drm_sched_backend_ops {
  * @job_id_count: used to assign unique id to the each job.
  * @submit_wq: workqueue used to queue @work_run_job and @work_free_job
  * @timeout_wq: workqueue used to queue @work_tdr
+ * @avg_job_us: Average job duration.
  * @work_run_job: work which calls run_job op of each scheduler.
  * @work_free_job: work which calls free_job op of each scheduler.
  * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
@@ -613,6 +617,7 @@ struct drm_gpu_scheduler {
 	atomic64_t			job_id_count;
 	struct workqueue_struct		*submit_wq;
 	struct workqueue_struct		*timeout_wq;
+	struct ewma_drm_sched_avgtime   avg_job_us;
 	struct work_struct		work_run_job;
 	struct work_struct		work_free_job;
 	struct delayed_work		work_tdr;
-- 
2.52.0

next prev parent reply	other threads:[~2026-03-06 16:35 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-06 16:34 [PATCH v7 00/29] Fair(er) DRM scheduler Tvrtko Ursulin
2026-03-06 16:34 ` [PATCH v7 01/29] drm/sched: Disallow initializing entities with no schedulers Tvrtko Ursulin
2026-03-06 16:34 ` [PATCH v7 02/29] drm/sched: Consolidate entity run queue management Tvrtko Ursulin
2026-03-06 16:34 ` [PATCH v7 03/29] drm/sched: Move run queue related code into a separate file Tvrtko Ursulin
2026-03-06 16:34 ` [PATCH v7 04/29] drm/sched: Add some scheduling quality unit tests Tvrtko Ursulin
2026-03-08 10:10   ` kernel test robot
2026-03-06 16:34 ` [PATCH v7 05/29] drm/sched: Add some more " Tvrtko Ursulin
2026-03-06 16:34 ` [PATCH v7 06/29] drm/sched: Implement RR via FIFO Tvrtko Ursulin
2026-03-06 16:34 ` [PATCH v7 07/29] drm/sched: Free all finished jobs at once Tvrtko Ursulin
2026-03-06 16:34 ` [PATCH v7 08/29] drm/sched: Account entity GPU time Tvrtko Ursulin
2026-03-06 16:34 ` [PATCH v7 09/29] drm/sched: Remove idle entity from tree Tvrtko Ursulin
2026-03-06 16:34 ` [PATCH v7 10/29] drm/sched: Add fair scheduling policy Tvrtko Ursulin
2026-03-06 16:34 ` Tvrtko Ursulin [this message]
2026-03-06 16:34 ` [PATCH v7 12/29] drm/sched: Switch default policy to fair Tvrtko Ursulin
2026-03-06 16:34 ` [PATCH v7 13/29] drm/sched: Remove FIFO and RR and simplify to a single run queue Tvrtko Ursulin
2026-03-06 16:34 ` [PATCH v7 14/29] drm/sched: Embed run queue singleton into the scheduler Tvrtko Ursulin
2026-03-06 16:34 ` [PATCH v7 15/29] accel/amdxdna: Remove drm_sched_init_args->num_rqs usage Tvrtko Ursulin
2026-03-06 16:34 ` [PATCH v7 16/29] accel/rocket: " Tvrtko Ursulin
2026-03-06 16:34 ` [PATCH v7 17/29] accel/ethosu: " Tvrtko Ursulin
2026-03-06 16:34 ` [PATCH v7 18/29] drm/amdgpu: " Tvrtko Ursulin
2026-03-06 16:34 ` [PATCH v7 19/29] drm/etnaviv: " Tvrtko Ursulin
2026-03-06 16:34 ` [PATCH v7 20/29] drm/imagination: " Tvrtko Ursulin
2026-03-06 16:34 ` [PATCH v7 21/29] drm/lima: " Tvrtko Ursulin
2026-03-06 16:34 ` [PATCH v7 22/29] drm/msm: " Tvrtko Ursulin
2026-03-06 16:34 ` [PATCH v7 23/29] drm/nouveau: " Tvrtko Ursulin
2026-03-06 16:34 ` [PATCH v7 24/29] drm/panfrost: " Tvrtko Ursulin
2026-03-06 16:34 ` [PATCH v7 25/29] drm/panthor: " Tvrtko Ursulin
2026-03-06 16:34 ` [PATCH v7 26/29] drm/sched: " Tvrtko Ursulin
2026-03-06 16:34 ` [PATCH v7 27/29] drm/v3d: " Tvrtko Ursulin
2026-03-06 16:34 ` [PATCH v7 28/29] drm/xe: " Tvrtko Ursulin
2026-03-06 16:34 ` [PATCH v7 29/29] drm/sched: Remove drm_sched_init_args->num_rqs Tvrtko Ursulin
2026-03-08 22:37 ` Claude review: Fair(er) DRM scheduler Claude Code Review Bot

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:94438f80b00 dfblob:add99c609fa dfblob:a682e0cfbf2
dfblob:ba5a9d9786e dfblob:0aca41b4e33 dfblob:337db7b1e68
dfblob:2fde309d02a dfblob:f8bc60026a7 dfblob:dd514a41315
dfblob:826de4cbbf2 )
 OR (
bs:"[PATCH v7 11/29] drm/sched: Favour interactive clients slightly" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260306163445.97243-12-tvrtko.ursulin@igalia.com \
    --to=tvrtko.ursulin@igalia.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=dakr@kernel.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=kernel-dev@igalia.com \
    --cc=matthew.brost@intel.com \
    --cc=phasta@kernel.org \
    --cc=pierre-eric.pelloux-prayer@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox