From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id BC25FE98DE0 for ; Mon, 23 Feb 2026 05:34:27 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 0D41D10E224; Mon, 23 Feb 2026 05:34:27 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="kIMU7jcN"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.16]) by gabe.freedesktop.org (Postfix) with ESMTPS id 9C61510E224; Mon, 23 Feb 2026 05:34:25 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1771824866; x=1803360866; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=9xoIUP1jVLvvoxcoJ31IffsEzAR5yPeKVSZRsi5eZxo=; b=kIMU7jcNHuiICdVXdtcDncDVo2c+bXlL0j+GlzrNSzVogtEoCK/v16XN mAcSpTbKdevl/WcUBkTDZNmALyB4NhNGnwSBsvNyrQ/smASWnjfd33q7F pScqal2svcAaTCGMN2d8bNMtpSGBTuXrKhoOnEM0XE1CmqgSBjJf0UKZc TzYKQLe5swNcj7dfEnZKqiQcxudGAUi0qj07wmlSb8jIXIYXmzPAinpsB 8bAzOP8Wa6z6/BF5kv/km7P8GeYNiaOLZcrpCCDvJekeNNuye/IavbzkL XpbWuhyV/9r6QyL4ieW9eTNZDwNA4ilvztJuX4Bgl1URPkrh/+m3JF5KZ g==; X-CSE-ConnectionGUID: 8MLnZBPoSRmkk+URyWrdTA== X-CSE-MsgGUID: 6x4UtkPMSPe86FGVXE1VzA== X-IronPort-AV: E=McAfee;i="6800,10657,11709"; a="72991508" X-IronPort-AV: E=Sophos;i="6.21,306,1763452800"; d="scan'208";a="72991508" Received: from orviesa001.jf.intel.com ([10.64.159.141]) by orvoesa108.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 22 Feb 2026 21:34:25 -0800 X-CSE-ConnectionGUID: 2/Y+NNWrSN6xQ/sgDqF2+Q== X-CSE-MsgGUID: WdZYvlxCSvyGpGXClD7pYQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,306,1763452800"; d="scan'208";a="253175147" Received: from rtauro-desk.iind.intel.com ([10.190.238.50]) by smtpauth.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 22 Feb 2026 21:34:21 -0800 From: Riana Tauro To: intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org Cc: aravind.iddamsetty@linux.intel.com, anshuman.gupta@intel.com, rodrigo.vivi@intel.com, joonas.lahtinen@linux.intel.com, simona.vetter@ffwll.ch, airlied@gmail.com, pratik.bari@intel.com, joshua.santosh.ranjan@intel.com, ashwin.kumar.kulkarni@intel.com, shubham.kumar@intel.com, ravi.kishore.koppuravuri@intel.com, raag.jadav@intel.com, anvesh.bakwad@intel.com, Riana Tauro Subject: [PATCH v8 3/5] drm/xe/xe_hw_error: Integrate DRM RAS with hardware error handling Date: Mon, 23 Feb 2026 11:35:43 +0530 Message-ID: <20260223060541.526397-10-riana.tauro@intel.com> X-Mailer: git-send-email 2.47.1 In-Reply-To: <20260223060541.526397-7-riana.tauro@intel.com> References: <20260223060541.526397-7-riana.tauro@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Initialize DRM RAS in hw error init. Map the UAPI error severities with the hardware error severities and refactor file. Signed-off-by: Riana Tauro Reviewed-by: Raag Jadav --- v2: Fix harware error enum add severity_str in csc handler simplify hw_error_info_init() function use drm_err if initialization fails (Raag) v3: print error on failure (Raag) v4: use const (Raag) --- drivers/gpu/drm/xe/xe_drm_ras_types.h | 8 ++++ drivers/gpu/drm/xe/xe_hw_error.c | 62 +++++++++++++++------------ 2 files changed, 42 insertions(+), 28 deletions(-) diff --git a/drivers/gpu/drm/xe/xe_drm_ras_types.h b/drivers/gpu/drm/xe/xe_drm_ras_types.h index 7acc5e7377b2..8d729ad6a264 100644 --- a/drivers/gpu/drm/xe/xe_drm_ras_types.h +++ b/drivers/gpu/drm/xe/xe_drm_ras_types.h @@ -11,6 +11,14 @@ struct drm_ras_node; +/* Error categories reported by hardware */ +enum hardware_error { + HARDWARE_ERROR_CORRECTABLE = 0, + HARDWARE_ERROR_NONFATAL, + HARDWARE_ERROR_FATAL, + HARDWARE_ERROR_MAX +}; + /** * struct xe_drm_ras_counter - XE RAS counter * diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c index 8c65291f36fc..baae050163df 100644 --- a/drivers/gpu/drm/xe/xe_hw_error.c +++ b/drivers/gpu/drm/xe/xe_hw_error.c @@ -10,20 +10,16 @@ #include "regs/xe_irq_regs.h" #include "xe_device.h" +#include "xe_drm_ras.h" #include "xe_hw_error.h" #include "xe_mmio.h" #include "xe_survivability_mode.h" #define HEC_UNCORR_FW_ERR_BITS 4 + extern struct fault_attr inject_csc_hw_error; -/* Error categories reported by hardware */ -enum hardware_error { - HARDWARE_ERROR_CORRECTABLE = 0, - HARDWARE_ERROR_NONFATAL = 1, - HARDWARE_ERROR_FATAL = 2, - HARDWARE_ERROR_MAX, -}; +static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES; static const char * const hec_uncorrected_fw_errors[] = { "Fatal", @@ -32,23 +28,18 @@ static const char * const hec_uncorrected_fw_errors[] = { "Data Corruption" }; -static const char *hw_error_to_str(const enum hardware_error hw_err) +static bool fault_inject_csc_hw_error(void) { - switch (hw_err) { - case HARDWARE_ERROR_CORRECTABLE: - return "CORRECTABLE"; - case HARDWARE_ERROR_NONFATAL: - return "NONFATAL"; - case HARDWARE_ERROR_FATAL: - return "FATAL"; - default: - return "UNKNOWN"; - } + return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1); } -static bool fault_inject_csc_hw_error(void) +static enum drm_xe_ras_error_severity hw_err_to_severity(const enum hardware_error hw_err) { - return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1); + if (hw_err == HARDWARE_ERROR_CORRECTABLE) + return DRM_XE_RAS_ERR_SEV_CORRECTABLE; + + /* Uncorrectable errors comprise of both fatal and non-fatal errors */ + return DRM_XE_RAS_ERR_SEV_UNCORRECTABLE; } static void csc_hw_error_work(struct work_struct *work) @@ -64,7 +55,8 @@ static void csc_hw_error_work(struct work_struct *work) static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err) { - const char *hw_err_str = hw_error_to_str(hw_err); + const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err); + const char *severity_str = error_severity[severity]; struct xe_device *xe = tile_to_xe(tile); struct xe_mmio *mmio = &tile->mmio; u32 base, err_bit, err_src; @@ -77,8 +69,8 @@ static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error lockdep_assert_held(&xe->irq.lock); err_src = xe_mmio_read32(mmio, HEC_UNCORR_ERR_STATUS(base)); if (!err_src) { - drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported HEC_ERR_STATUS_%s blank\n", - tile->id, hw_err_str); + drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported %s HEC_ERR_STATUS register blank\n", + tile->id, severity_str); return; } @@ -86,8 +78,8 @@ static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error fw_err = xe_mmio_read32(mmio, HEC_UNCORR_FW_ERR_DW0(base)); for_each_set_bit(err_bit, &fw_err, HEC_UNCORR_FW_ERR_BITS) { drm_err_ratelimited(&xe->drm, HW_ERR - "%s: HEC Uncorrected FW %s error reported, bit[%d] is set\n", - hw_err_str, hec_uncorrected_fw_errors[err_bit], + "HEC FW %s %s reported, bit[%d] is set\n", + hec_uncorrected_fw_errors[err_bit], severity_str, err_bit); schedule_work(&tile->csc_hw_error_work); @@ -99,7 +91,8 @@ static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err) { - const char *hw_err_str = hw_error_to_str(hw_err); + const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err); + const char *severity_str = error_severity[severity]; struct xe_device *xe = tile_to_xe(tile); unsigned long flags; u32 err_src; @@ -110,8 +103,8 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er spin_lock_irqsave(&xe->irq.lock, flags); err_src = xe_mmio_read32(&tile->mmio, DEV_ERR_STAT_REG(hw_err)); if (!err_src) { - drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported DEV_ERR_STAT_%s blank!\n", - tile->id, hw_err_str); + drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported %s DEV_ERR_STAT register blank!\n", + tile->id, severity_str); goto unlock; } @@ -146,6 +139,14 @@ void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl) hw_error_source_handler(tile, hw_err); } +static int hw_error_info_init(struct xe_device *xe) +{ + if (xe->info.platform != XE_PVC) + return 0; + + return xe_drm_ras_init(xe); +} + /* * Process hardware errors during boot */ @@ -172,11 +173,16 @@ static void process_hw_errors(struct xe_device *xe) void xe_hw_error_init(struct xe_device *xe) { struct xe_tile *tile = xe_device_get_root_tile(xe); + int ret; if (!IS_DGFX(xe) || IS_SRIOV_VF(xe)) return; INIT_WORK(&tile->csc_hw_error_work, csc_hw_error_work); + ret = hw_error_info_init(xe); + if (ret) + drm_err(&xe->drm, "Failed to initialize XE DRM RAS (%pe)\n", ERR_PTR(ret)); + process_hw_errors(xe); } -- 2.47.1