From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 6C138CD6E5D for ; Tue, 2 Jun 2026 11:11:04 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id CD88D10EF32; Tue, 2 Jun 2026 11:11:03 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=igalia.com header.i=@igalia.com header.b="jgb+fxYg"; dkim-atps=neutral Received: from fanzine2.igalia.com (fanzine2.igalia.com [213.97.179.56]) by gabe.freedesktop.org (Postfix) with ESMTPS id 4F69B10E046 for ; Tue, 2 Jun 2026 11:11:02 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=igalia.com; s=20170329; h=MIME-Version:Content-Transfer-Encoding:Content-Type:References: In-Reply-To:Date:Cc:To:From:Subject:Message-ID:Sender:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=7RgJfVQx32/rYLUVjSWUv2ZbHkzbURK8tDCps/DhjpU=; b=jgb+fxYgPPBFYDLQQSZO8rZAhP biVEEcJFEXtwDCWzczu9O1Trae2GdYkBGwNd1nOC3Yzi5v4DedRbO+MCs09A2FBHFHZQDYh396uQu XssLFATCSkg2FoJUMloh0F/z7/7U+Za2fvG23R3axJ10ge/w5YSk6lYcrsP4n/LvPi/wHKJ2di8pI T9YKMw+Tsul4sGiVzfnL1keQyShuZOdLkiZKHOQwPF5qU34hIAq8a8pJ1LeCG/CSwrEZPD53kqrGA 03tiJUYlKqK+MDMzuOLJbMqkhEP05ZmQD8GIEVDulCMXMH1U4E4d6hR92XGN7wbj8+w9VRCm94o8Z AzOr42GA==; Received: from static-234-112-85-188.ipcom.comunitel.net ([188.85.112.234] helo=[192.168.0.17]) by fanzine2.igalia.com with esmtpsa (Cipher TLS1.3:ECDHE_SECP256R1__RSA_PSS_RSAE_SHA256__AES_256_GCM:256) (Exim) id 1wUN1O-00BfPi-EM; Tue, 02 Jun 2026 13:10:58 +0200 Message-ID: Subject: Re: [PATCH v2 3/4] drm/v3d: Serialize jobs across queues when a perfmon is attached From: Iago Toral To: =?ISO-8859-1?Q?Ma=EDra?= Canal , Melissa Wen , Tvrtko Ursulin , David Airlie , Simona Vetter Cc: kernel-dev@igalia.com, dri-devel@lists.freedesktop.org Date: Tue, 02 Jun 2026 13:10:47 +0200 In-Reply-To: <44cbf6ec-3134-44d5-b37c-0c99ac55f678@igalia.com> References: <20260531-v3d-perfmon-lifetime-v2-0-60ed4485a203@igalia.com> <20260531-v3d-perfmon-lifetime-v2-3-60ed4485a203@igalia.com> <7a9d503de81b403585962f1c25dd0b506a6eb3f2.camel@igalia.com> <44cbf6ec-3134-44d5-b37c-0c99ac55f678@igalia.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.52.3-0ubuntu1.1 MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Hi Ma=C3=ADra, thanks for clarifying this. For the series: Reviewed-by: Iago Toral Quiroga Iago El mar, 02-06-2026 a las 07:49 -0300, Ma=C3=ADra Canal escribi=C3=B3: > Hi Iago, >=20 > On 02/06/26 04:35, Iago Toral wrote: > > Hi Ma=C3=ADra, this pactch looks good to me but I am wondering if we > > should > > do anything to clean up the perfmon state when a gpu reset happens: > > i.e. what would happen if a perfmon job resets? Would we continue > > to > > schedule after-reset jobs to wait on the fence of the reset job? >=20 > If the job resets, the design around the DRM scheduler will protect > us. >=20 > The DRM scheduler guarantees that a finished fence is always signaled > eventually. On a reset, drm_sched_start() signals the pending jobs' > finished fences (with -ECANCELED for the guilty job). So if a > perfmon-carrying job is the one that times out, the done_fence we > stored for serialization still gets signaled when resubmitted (check > the > first if-clause in the run_job() callbacks). >=20 > drm_sched_job_add_dependency() releases the dependent job as soon as > the > fence signals, regardless of error, so any job submitted behind the > reset perfmon job is released and runs. The serialization window > simply > closes when the job dies. >=20 > Best regards, > - Ma=C3=ADra >=20 > >=20 > > Iago > >=20 > > El dom, 31-05-2026 a las 17:18 -0300, Ma=C3=ADra Canal escribi=C3=B3: > > > A non-global perfmon is meant to count events generated by a > > > specific > > > submission, but the scheduler can run jobs from different queues > > > concurrently on the same V3D core. Without explicit > > > serialization, an > > > unrelated job running in parallel with a perfmon-carrying job > > > pollutes > > > the counters and generates unusable results. > > >=20 > > > To address such issue, we must enforce cross-queue serialization > > > when > > > we > > > detect a perfmon-carrying submission. It's possible to implement > > > serialization by enforcing two rules: > > >=20 > > > =C2=A0=C2=A0 1. A job that carries a non-global perfmon must wait for= every > > > job > > > =C2=A0=C2=A0=C2=A0=C2=A0 currently in-flight across all HW queues to = finish. > > >=20 > > > =C2=A0=C2=A0 2. While a perfmon-carrying job is still in-flight, all > > > subsequently > > > =C2=A0=C2=A0=C2=A0=C2=A0 submitted jobs must wait for it. > > >=20 > > > Note that serialization is not needed in the global perfmon case, > > > as > > > the > > > global perfmon tracks activity from all jobs, so concurrency is > > > desirable. > > >=20 > > > Therefore, check if serialization is needed during job submission > > > and > > > if > > > so, attach fence dependences to enforce cross-queue > > > serialization. > > >=20 > > > Signed-off-by: Ma=C3=ADra Canal > > > --- > > > =C2=A0=C2=A0drivers/gpu/drm/v3d/v3d_drv.h=C2=A0=C2=A0=C2=A0=C2=A0 | 3= 5 ++++++++++++++++++---- > > > =C2=A0=C2=A0drivers/gpu/drm/v3d/v3d_gem.c=C2=A0=C2=A0=C2=A0=C2=A0 |= =C2=A0 3 ++ > > > =C2=A0=C2=A0drivers/gpu/drm/v3d/v3d_perfmon.c |=C2=A0 6 ++++ > > > =C2=A0=C2=A0drivers/gpu/drm/v3d/v3d_submit.c=C2=A0 | 63 > > > +++++++++++++++++++++++++++++++++++++++ > > > =C2=A0=C2=A04 files changed, 101 insertions(+), 6 deletions(-) > > >=20 > > > diff --git a/drivers/gpu/drm/v3d/v3d_drv.h > > > b/drivers/gpu/drm/v3d/v3d_drv.h > > > index 51486af68cf4..cdf4926d51f2 100644 > > > --- a/drivers/gpu/drm/v3d/v3d_drv.h > > > +++ b/drivers/gpu/drm/v3d/v3d_drv.h > > > @@ -74,11 +74,13 @@ struct v3d_queue_state { > > > =C2=A0=C2=A0 spinlock_t queue_lock; > > > =C2=A0=C2=A0}; > > > =C2=A0=20 > > > -/* Performance monitor object. The perform lifetime is > > > controlled by > > > userspace > > > - * using perfmon related ioctls. A perfmon can be attached to a > > > submit_cl > > > - * request, and when this is the case, HW perf counters will be > > > activated just > > > - * before the submit_cl is submitted to the GPU and disabled > > > when > > > the job is > > > - * done. This way, only events related to a specific job will be > > > counted. > > > +/* Performance monitor object > > > + * > > > + * The performance monitor (perfmon) lifetime is controlled by > > > userspace using > > > + * perfmon related ioctls. A perfmon can be attached to a CL or > > > CSD > > > submission > > > + * request, and when it is, HW performance counters will be > > > activated just > > > + * before the job is submitted to the GPU and disabled when the > > > job > > > is done. > > > + * This way, only events related to a specific submission will > > > be > > > counted. > > > =C2=A0=C2=A0 */ > > > =C2=A0=C2=A0struct v3d_perfmon { > > > =C2=A0=C2=A0 /* Tracks the number of users of the perfmon, when this > > > counter reaches > > > @@ -167,13 +169,31 @@ struct v3d_dev { > > > =C2=A0=20 > > > =C2=A0=C2=A0 struct v3d_queue_state queue[V3D_MAX_QUEUES]; > > > =C2=A0=20 > > > - /* Tracks the performance monitor state. */ > > > + /* > > > + * Tracks the performance monitor state and consistency. > > > + * > > > + * When a non-global perfmon is attached to a job, the > > > scheduler must > > > + * not run any other job on the HW concurrently > > > (otherwise, > > > the > > > + * counters would be polluted by unrelated work). > > > + */ > > > =C2=A0=C2=A0 struct { > > > =C2=A0=C2=A0 /* Protects @active. */ > > > =C2=A0=C2=A0 spinlock_t lock; > > > =C2=A0=20 > > > =C2=A0=C2=A0 /* Perfmon currently programmed in HW (or NULL > > > if > > > none). */ > > > =C2=A0=C2=A0 struct v3d_perfmon *active; > > > + > > > + /* Finished fence of the most recently submitted > > > job > > > that > > > + * opened a serialization window (i.e. a job > > > with a > > > non-global > > > + * perfmon attached). > > > + */ > > > + struct dma_fence *fence; > > > + > > > + /* Finished fence of the most recently submitted > > > job > > > on each HW > > > + * queue. Used so that a new perfmon-carrying > > > job > > > can depend on > > > + * every job currently in-flight across all > > > queues. > > > + */ > > > + struct dma_fence *last_hw_fence[V3D_MAX_QUEUES]; > > > =C2=A0=C2=A0 } perfmon_state; > > > =C2=A0=20 > > > =C2=A0=C2=A0 /* Protects bo_stats */ > > > @@ -319,6 +339,9 @@ struct v3d_job { > > > =C2=A0=20 > > > =C2=A0=C2=A0 struct v3d_dev *v3d; > > > =C2=A0=20 > > > + /* The queue that the job was submitted on. */ > > > + enum v3d_queue queue; > > > + > > > =C2=A0=C2=A0 /* This is the array of BOs that were looked up at the > > > start > > > =C2=A0=C2=A0 * of submission. > > > =C2=A0=C2=A0 */ > > > diff --git a/drivers/gpu/drm/v3d/v3d_gem.c > > > b/drivers/gpu/drm/v3d/v3d_gem.c > > > index 9487ab7acd03..0b415486f5b4 100644 > > > --- a/drivers/gpu/drm/v3d/v3d_gem.c > > > +++ b/drivers/gpu/drm/v3d/v3d_gem.c > > > @@ -363,8 +363,11 @@ v3d_gem_destroy(struct drm_device *dev) > > > =C2=A0=C2=A0 for (q =3D 0; q < V3D_MAX_QUEUES; q++) { > > > =C2=A0=C2=A0 WARN_ON(v3d->queue[q].active_job); > > > =C2=A0=C2=A0 v3d_stats_put(v3d->queue[q].stats); > > > + dma_fence_put(v3d- > > > >perfmon_state.last_hw_fence[q]); > > > =C2=A0=C2=A0 } > > > =C2=A0=20 > > > + dma_fence_put(v3d->perfmon_state.fence); > > > + > > > =C2=A0=C2=A0 drm_mm_takedown(&v3d->mm); > > > =C2=A0=20 > > > =C2=A0=C2=A0 dma_free_coherent(v3d->drm.dev, 4096 * 1024, (void > > > *)v3d- > > > > pt, > > > diff --git a/drivers/gpu/drm/v3d/v3d_perfmon.c > > > b/drivers/gpu/drm/v3d/v3d_perfmon.c > > > index 3ad0f022753c..07dab7fb3060 100644 > > > --- a/drivers/gpu/drm/v3d/v3d_perfmon.c > > > +++ b/drivers/gpu/drm/v3d/v3d_perfmon.c > > > @@ -275,6 +275,12 @@ void v3d_perfmon_start(struct v3d_dev *v3d, > > > struct v3d_perfmon *perfmon) > > > =C2=A0=C2=A0 if (!perfmon || v3d->global_perfmon) > > > =C2=A0=C2=A0 return; > > > =C2=A0=20 > > > + /* Cross-queue serialization should have drained any > > > previous perfmon > > > + * job before this one runs. > > > + */ > > > + if (WARN_ON_ONCE(v3d->perfmon_state.active)) > > > + return; > > > + > > > =C2=A0=C2=A0 if (!pm_runtime_get_if_active(v3d->drm.dev)) > > > =C2=A0=C2=A0 return; > > > =C2=A0=20 > > > diff --git a/drivers/gpu/drm/v3d/v3d_submit.c > > > b/drivers/gpu/drm/v3d/v3d_submit.c > > > index 4c526aafc4e0..46fc88bacc95 100644 > > > --- a/drivers/gpu/drm/v3d/v3d_submit.c > > > +++ b/drivers/gpu/drm/v3d/v3d_submit.c > > > @@ -234,6 +234,7 @@ v3d_submit_add_job(struct v3d_submit *submit, > > > void **container, size_t size, > > > =C2=A0=C2=A0 job->v3d =3D v3d; > > > =C2=A0=C2=A0 job->free =3D free; > > > =C2=A0=C2=A0 job->file_priv =3D v3d_priv; > > > + job->queue =3D queue; > > > =C2=A0=20 > > > =C2=A0=C2=A0 ret =3D drm_sched_job_init(&job->base, &v3d_priv- > > > > sched_entity[queue], > > > =C2=A0=C2=A0 1, v3d_priv, submit->file_priv- > > > > client_id); > > > @@ -293,6 +294,62 @@ v3d_attach_perfmon_to_jobs(struct v3d_submit > > > *submit, u32 perfmon_id) > > > =C2=A0=C2=A0 return 0; > > > =C2=A0=C2=A0} > > > =C2=A0=20 > > > +/* > > > + * Prepare fences to enforce job serialization when a perfmon is > > > active. A job > > > + * that carries a non-global perfmon must wait for every job > > > currently in-flight > > > + * across all HW queues to finish, otherwise concurrent > > > unrelated > > > work on the > > > + * same core would pollute the performance counters. > > > Symmetrically, > > > while such a > > > + * job is still in-flight, all subsequently submitted jobs must > > > wait > > > for it. > > > + * > > > + * We don't serialize the jobs when using a global perfmon as > > > it's > > > expected to > > > + * track concurrent activity from all jobs. > > > + */ > > > +static int > > > +v3d_serialize_for_perfmon(struct v3d_job *job) > > > +{ > > > + struct v3d_dev *v3d =3D job->v3d; > > > + bool is_global_perfmon; > > > + int ret; > > > + > > > + lockdep_assert_held(&v3d->sched_lock); > > > + > > > + scoped_guard(spinlock_irqsave, &v3d->perfmon_state.lock) > > > + is_global_perfmon =3D !!v3d->global_perfmon; > > > + > > > + if (is_global_perfmon) > > > + goto publish; > > > + > > > + if (job->perfmon) { > > > + for (enum v3d_queue q =3D 0; q < V3D_MAX_QUEUES; > > > q++) > > > { > > > + struct dma_fence *f =3D v3d- > > > > perfmon_state.last_hw_fence[q]; > > > + > > > + if (!f || dma_fence_is_signaled(f)) > > > + continue; > > > + > > > + ret =3D drm_sched_job_add_dependency(&job- > > > > base, dma_fence_get(f)); > > > + if (ret) > > > + return ret; > > > + } > > > + } else if (v3d->perfmon_state.fence && > > > + =C2=A0=C2=A0 !dma_fence_is_signaled(v3d- > > > >perfmon_state.fence)) > > > { > > > + ret =3D drm_sched_job_add_dependency(&job->base, > > > + =09 > > > dma_fence_get(v3d->perfmon_state.fence)); > > > + if (ret) > > > + return ret; > > > + } > > > + > > > +publish: > > > + dma_fence_put(v3d->perfmon_state.last_hw_fence[job- > > > >queue]); > > > + v3d->perfmon_state.last_hw_fence[job->queue] =3D > > > dma_fence_get(job->done_fence); > > > + > > > + if (job->perfmon && !is_global_perfmon) { > > > + dma_fence_put(v3d->perfmon_state.fence); > > > + v3d->perfmon_state.fence =3D dma_fence_get(job- > > > > done_fence); > > > + } > > > + > > > + return 0; > > > +} > > > + > > > =C2=A0=C2=A0static void > > > =C2=A0=C2=A0v3d_submit_attach_object_fences(struct v3d_submit *submit= ) > > > =C2=A0=C2=A0{ > > > @@ -337,6 +394,12 @@ v3d_submit_jobs(struct v3d_submit *submit) > > > =C2=A0=C2=A0 goto err; > > > =C2=A0=C2=A0 } > > > =C2=A0=20 > > > + for (int i =3D 0; i < submit->job_count; i++) { > > > + ret =3D v3d_serialize_for_perfmon(submit- > > > >jobs[i]); > > > + if (ret) > > > + goto err; > > > + } > > > + > > > =C2=A0=C2=A0 for (int i =3D 0; i < submit->job_count; i++) > > > =C2=A0=C2=A0 drm_sched_entity_push_job(&submit->jobs[i]- > > > >base); > > > =C2=A0=20 > > >=20 > >=20 >=20 >=20