Reliability, Accessibility and Serviceability: RAS In Xavier triggering for verification

Hi, I’m working with the Nvidia Xavier Industrial and I want to have a program that reads/captures the RAS reports errors in order to send it to an external device during testing. Reading to the RAS code in the Xavier I think that at some point mention or implies that there is a way of trigger the RAS in order to test it. Anyone knows how I can manually trigger a RAS error for verification of my software? Thanks.

As of 2019, we had to write our own code to do capture the RAS codes. We basically wrote a kernel module that exposed a very simple polling interface to user space that read the RAS registers. IIRC, we were able to inject errors as well; again with direct kernel register programming from that same interface.

Again, this is a bit dated, but this is the macro to read RAS registers:

#define regstr(name) ({ \
                offset += snprintf(msg_buffer+offset, MSG_BUFFER_LEN-offset, "%s %lld\n", __stringify(name), read_sysreg(name));\

Used like:

regstr(ID_AA64PFR0_EL1          );

Reading the ERXMISC0 registers directly was a bit more complicated. With appropriate registers, it looks something like:

        for(idx = 0; idx < sizeof(err_select_reg)/sizeof(int); idx++)
                write_sysreg_s(err_select_reg[idx], sys_reg(3,0,5,3,1));
                offset += snprintf(msg_buffer+offset, MSG_BUFFER_LEN-offset, "ERXMISC0.%03x %lld\n", (int) idx, read_sysreg_s(sys_reg(3,0,5,5,0)));

There are a couple of functions for reading ras_sec_errctr() and ras_ded_errctr() directly, in linux/arm64_ras.h.

As I look through our current code, we don’t exercise injection anymore. But I think the format is similar to the mechanism for reading ERRMISC0. First you write to a selection register, then you write the value.

Hope this helps…

Reading the node will provide help/guide for error injection. Writing the value to this node will do an error injection.
cat /sys/kernel/debug/carmel_ras/RAS_MCA_ERR-trip
For code, please refer to the below files for more info about RAS registers.
File: drivers/platform/tegra/carmel_ras.c, drivers/ras/arm64_ras.c

Hi, thanks a lot, that will very helpful for the capturing part of the RAS errors. Now only I need to manually trigger RAS for verification.

How I suppose to use the values that are found in /sys/kernel/debug/carmel_ras/ in order to trigger that RAS errors?

Reading the node will provide info about how to inject and trigger a RAS error.
-# cat /sys/kernel/debug/carmel_ras/RAS_MCA_ERR-trip
[ 107.764616] Please write data in below format to this node for injecting RAS error.
[ 107.764616] echo EEDDCCBBAA > RAS_MCA_ERR-trip
[ 107.764616] where:
[ 107.764616] EE[32-39] - L3_Bank_ID
[ 107.764616] DD[24-31] - Logical_Cluster_ID
[ 107.764616] CC[16-23] - Logical_CPU_ID
[ 107.764616] BB[08-15] - Error type(Corr is 0, UnCorr is 1)
[ 107.764616] AA[00-07] - Unit
[ 107.764616] Unit values are:
[ 107.764616] 1)IFU
[ 107.764616] 2)JSR RET
[ 107.764616] 3)JSR MTS
[ 107.764616] 4)LSD STQ
[ 107.764616] 5)LSD DCC
[ 107.764616] 6)LSD L1HPF
[ 107.764616] 7)L2
[ 107.764616] 8)Cluster Clocks
[ 107.764616] 9)MMU
[ 107.764616] 10)L3
[ 107.764616] 11)CCPMU
[ 107.764616] 12)SCF IOB
[ 107.764616] 13)SCF SNOC
[ 107.764616] 14)SCF CTU
[ 107.764616] 15)CMU Clocks

-# echo > /sys/kernel/debug/carmel_ras/RAS_MCA_ERR-trip /*This will trigger RAS error of required type */

Hi I just tested in both Nvidia Xavier NX and Nvidia Xavier Industrial and when I do:

cat /sys/kernel/debug/carmel_ras/RAS_MCA_ERR-trip

I get only the output:

and doing the :
echo > /sys/kernel/debug/carmel_ras/RAS_MCA_ERR-trip
returns the error
echo: write error: Invalid argument

In the Industrial and NX I’m using the same version: 4.9.253-tegra

Please increase the log level (or) do “dmesg -n 30” to get output of “cat /sys/kernel/debug/carmel_ras/RAS_MCA_ERR-trip”.

Example of error injection:
Uncorrectable Error: # echo 0x2010b > /sys/kernel/debug/carmel_ras/RAS_MCA_ERR-trip
Correctable Error: # echo 0x2000b > /sys/kernel/debug/carmel_ras/RAS_MCA_ERR-trip

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.