From: Riana Tauro <riana.tauro@intel.com>
To: intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org
Cc: aravind.iddamsetty@linux.intel.com, anshuman.gupta@intel.com,
rodrigo.vivi@intel.com, joonas.lahtinen@linux.intel.com,
simona.vetter@ffwll.ch, airlied@gmail.com, pratik.bari@intel.com,
joshua.santosh.ranjan@intel.com, ashwin.kumar.kulkarni@intel.com,
shubham.kumar@intel.com, ravi.kishore.koppuravuri@intel.com,
raag.jadav@intel.com, anvesh.bakwad@intel.com,
Riana Tauro <riana.tauro@intel.com>
Subject: [PATCH v8 0/5] Introduce DRM_RAS using generic netlink for RAS
Date: Mon, 23 Feb 2026 11:35:40 +0530 [thread overview]
Message-ID: <20260223060541.526397-7-riana.tauro@intel.com> (raw)
This work is a continuation of the great work started by Aravind ([1] and [2])
in order to fulfill the RAS requirements and proposal as previously discussed
and agreed in the Linux Plumbers accelerator's bof of 2022 [3].
[1]: https://lore.kernel.org/dri-devel/20250730064956.1385855-1-aravind.iddamsetty@linux.intel.com/
[2]: https://lore.kernel.org/all/4cbdfcc5-5020-a942-740e-a602d4c00cc2@linux.intel.com/
[3]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
During the past review round, Lukas pointed out that netlink had evolved
in parallel during these years and that now, any new usage of netlink families
would require the usage of the YAML description and scripts.
With this new requirement in place, the family name is hardcoded in the yaml file,
so we are forced to have a single family name for the entire drm, and then we now
we are forced to have a registration.
So, while doing the registration, we now created the concept of drm-ras-node.
For now the only node type supported is the agreed error-counter. But that could
be expanded for other cases like telemetry, requested by Zack for the qualcomm accel
driver.
In this first version, only querying counter is supported. But also this is expandable
to future introduction of multicast notification and also clearing the counters.
This design with multiple nodes per device is already flexible enough for driver
to decide if it wants to handle error per device, or per IP block, or per error
category. I believe this fully attend to the requested AMD feedback in the earlier
reviews.
So, my proposal is to start simple with this case as is, and then iterate over
with the drm-ras in tree so we evolve together according to various driver's RAS
needs.
I have provided a documentation and the first Xe implementation of the counter
as reference.
Also, it is worth to mention that we have a in-tree pyynl/cli.py tool that entirely
exercises this new API, hence I hope this can be the reference code for the uAPI
usage, while we continue with the plan of introducing IGT tests and tools for this
and adjusting the internal vendor tools to open with open source developments and
changing them to support these flows.
Example:
1) List Nodes:
$ sudo ynl --family drm_ras --dump list-nodes
[{'device-name': '0000:03:00.0',
'node-id': 0,
'node-name': 'correctable-errors',
'node-type': 'error-counter'},
{'device-name': '0000:03:00.0',
'node-id': 1,
'node-name': 'uncorrectable-errors',
'node-type': 'error-counter'}]
2) Get Error counters:
$ sudo ynl --family drm_ras --dump get-error-counter --json '{"node-id":0}'
[{'error-id': 1, 'error-name': 'core-compute', 'error-value': 0},
{'error-id': 2, 'error-name': 'soc-internal', 'error-value': 0}]
3) Get specific Error counter:
$ sudo ynl --family drm_ras --do get-error-counter --json '{"node-id":0, "error-id":1}'
{'error-id': 1, 'error-name': 'core-compute', 'error-value': 0}
IGT : https://patchwork.freedesktop.org/patch/689729/?series=157409&rev=3
Rev2: Fix review comments
Add support for GT and SOC errors
Rev3: Add uAPI for errors and nodes
Update documentation
Rev4: Use only correctable and uncorrectable error nodes
use REG_BIT
remove redundant error strings
Rev5: Split patch 2
use atomic_t
fix memory leaks
fix logs
fix hook failure
change component and severity UAPI
Rev6: fix alignment
fix comparison in CSC error
add severity string to csc error
rename soc error handler base register variables
deallocate info if drm ras registeration fails
rename init function to xe_drm_ras_init()
fix htmldocs errors
Add 'depends on NET' for drm ras netlink
Rev7: add macro for gt vector length and master local registers
print errors on failure
Rev8: use single command for both do/dump operations
fix yamllint errors
regenerate files
fix kernel-doc
Riana Tauro (4):
drm/xe/xe_drm_ras: Add support for XE DRM RAS
drm/xe/xe_hw_error: Integrate DRM RAS with hardware error handling
drm/xe/xe_hw_error: Add support for Core-Compute errors
drm/xe/xe_hw_error: Add support for PVC SoC errors
Rodrigo Vivi (1):
drm/ras: Introduce the DRM RAS infrastructure over generic netlink
Documentation/gpu/drm-ras.rst | 103 +++++
Documentation/gpu/index.rst | 1 +
Documentation/netlink/specs/drm_ras.yaml | 118 ++++++
drivers/gpu/drm/Kconfig | 10 +
drivers/gpu/drm/Makefile | 1 +
drivers/gpu/drm/drm_drv.c | 6 +
drivers/gpu/drm/drm_ras.c | 352 ++++++++++++++++
drivers/gpu/drm/drm_ras_genl_family.c | 42 ++
drivers/gpu/drm/drm_ras_nl.c | 55 +++
drivers/gpu/drm/xe/Makefile | 1 +
drivers/gpu/drm/xe/regs/xe_hw_error_regs.h | 86 +++-
drivers/gpu/drm/xe/xe_device_types.h | 4 +
drivers/gpu/drm/xe/xe_drm_ras.c | 186 +++++++++
drivers/gpu/drm/xe/xe_drm_ras.h | 15 +
drivers/gpu/drm/xe/xe_drm_ras_types.h | 48 +++
drivers/gpu/drm/xe/xe_hw_error.c | 451 +++++++++++++++++++--
include/drm/drm_ras.h | 75 ++++
include/drm/drm_ras_genl_family.h | 17 +
include/drm/drm_ras_nl.h | 25 ++
include/uapi/drm/drm_ras.h | 49 +++
include/uapi/drm/xe_drm.h | 79 ++++
21 files changed, 1682 insertions(+), 42 deletions(-)
create mode 100644 Documentation/gpu/drm-ras.rst
create mode 100644 Documentation/netlink/specs/drm_ras.yaml
create mode 100644 drivers/gpu/drm/drm_ras.c
create mode 100644 drivers/gpu/drm/drm_ras_genl_family.c
create mode 100644 drivers/gpu/drm/drm_ras_nl.c
create mode 100644 drivers/gpu/drm/xe/xe_drm_ras.c
create mode 100644 drivers/gpu/drm/xe/xe_drm_ras.h
create mode 100644 drivers/gpu/drm/xe/xe_drm_ras_types.h
create mode 100644 include/drm/drm_ras.h
create mode 100644 include/drm/drm_ras_genl_family.h
create mode 100644 include/drm/drm_ras_nl.h
create mode 100644 include/uapi/drm/drm_ras.h
--
2.47.1
next reply other threads:[~2026-02-23 5:33 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-23 6:05 Riana Tauro [this message]
2026-02-23 6:05 ` [PATCH v8 1/5] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Riana Tauro
2026-02-24 0:45 ` Claude review: " Claude Code Review Bot
2026-02-23 6:05 ` [PATCH v8 2/5] drm/xe/xe_drm_ras: Add support for XE DRM RAS Riana Tauro
2026-02-24 0:45 ` Claude review: " Claude Code Review Bot
2026-02-23 6:05 ` [PATCH v8 3/5] drm/xe/xe_hw_error: Integrate DRM RAS with hardware error handling Riana Tauro
2026-02-24 0:45 ` Claude review: " Claude Code Review Bot
2026-02-23 6:05 ` [PATCH v8 4/5] drm/xe/xe_hw_error: Add support for Core-Compute errors Riana Tauro
2026-02-24 0:45 ` Claude review: " Claude Code Review Bot
2026-02-23 6:05 ` [PATCH v8 5/5] drm/xe/xe_hw_error: Add support for PVC SoC errors Riana Tauro
2026-02-24 0:45 ` Claude review: " Claude Code Review Bot
2026-02-24 0:45 ` Claude review: Introduce DRM_RAS using generic netlink for RAS Claude Code Review Bot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260223060541.526397-7-riana.tauro@intel.com \
--to=riana.tauro@intel.com \
--cc=airlied@gmail.com \
--cc=anshuman.gupta@intel.com \
--cc=anvesh.bakwad@intel.com \
--cc=aravind.iddamsetty@linux.intel.com \
--cc=ashwin.kumar.kulkarni@intel.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=intel-xe@lists.freedesktop.org \
--cc=joonas.lahtinen@linux.intel.com \
--cc=joshua.santosh.ranjan@intel.com \
--cc=pratik.bari@intel.com \
--cc=raag.jadav@intel.com \
--cc=ravi.kishore.koppuravuri@intel.com \
--cc=rodrigo.vivi@intel.com \
--cc=shubham.kumar@intel.com \
--cc=simona.vetter@ffwll.ch \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox