RAS TEGRA_23x_SOC modification sources

I need to modify the RAS API on jetson Orin for some tests. For the Xavier family of devices this changes can be input on file “/kernel/nvidia/drivers/ras/arm64_ras.c”, but know I’m using NVIDIA Jetson Linux 35.5.0, and I notice that in that driver, in the Makefile is that in only enable for TEGRA_19x_SOC. (File kernel/nvidia/drivers/ras/Makefile)

ccflags-y += -Werror
obj-$(CONFIG_ARM64_RAS) += arm64_ras.o

I know that the same errors also apear in the Orin under testing, but I can not find were is the source for Orin is. The only thing that I find is in the binary file “tos-optee_t234.img”, some text bake-in that image that refer to the errors that I record.

So my question is, were is the source that reports the RAS errors on the TEGRA_23x_SOC devices.

hello ivanrodriguezferrandez,

may I know what’s the real use-case, this is driver specific for Xavier series (i.e. t19x series) .

Hi @JerryChang.
For now how the driver works, when a RAS error happen, both in Xavier and Orin the error is reported and print out to the system mesages (dmesg), when and ucorrectable error happens, some time (was more noticeable for Jetpack 4) the driver issues a kernel panic, and the system reboots. As part of my PhD thesis, I need to modify the drivers to have other linux module to read and process the RAS error as ways to mitigate some of the RAS errors that happen, like L2 or L3 uncorrectable errors. So for Xavier is clear that this changes can be injected in arm64_ras.c file, but Is not so clear to me where are that driver for the Orin devices.

hello ivanrodriguezferrandez,

is it possible to share some failure messages for reference.

Yes, Sure, here are some of the examples that I have.
This is from the ORIN NX

RAS Uncorrectable Error in SCC, base=0xe018000:
 	    Status = 0xec001007
        SERR = Address/control value from associative memory: 0x7
	    IERR = L2 Dir Parity Error: 0x10
	    Overflow (there may be more errors) - Uncorrectable
 	    MISC0 = 0x260880
  	    MISC2 = 0x0
  	    ADDR = 0xe000000446cece80
   sdei_dispatch_event returned -1

And this is from the Xavier NX platform

RAS Error in L2, ERRSELR_EL1=0x200:
Status = 0xcd005006

hello ivanrodriguezferrandez,

how you reproduce this? did you access to the memory address (i.e. devmem) directly?

Hi @JerryChang
That output that I share was generated in a real test in extreme environment in witch the Jetson print out such errors. Without external sources I don’t truly know how to generate those errors, in Jetpack 4 was much easier, having an driver for that. For jetpack 5 for the Orin NX, I know that in order to generate such errors you can use the ERR0PFGCTL, Error Pseudo Fault Generation Control Register from the A78AE, but I didn’t manage to generate an error in these way.

could you please refer to Jetson Orin NX Series Data Sheet to check the [Operating Requirements].

That Operating Requirements descrived in the document, were mantained during the test. But that does not change the question itself, that is were is the source code that generate such print erros for the 23x_SoC

hello ivanrodriguezferrandez,

it’s error log printed by Arm Trusted Firmware on detecting a RAS uncorrectable error.
may I double confirm the platform and L4T release version you’re using?
for instance, is it Orin NX DevKit, with l4t-r35.5.0 (Jetpack-5.1.3)

Hello JerryChang

Correct the error is printed by Arm Trusted Firmware also call RAS.

That print out happen during the test with a Orin NX module, connected to a Orin Nano carrier board, with L4T L4T 35.4.

hello ivanrodriguezferrandez,

please visit Jetson Linux 35.5.0, by downloading the [Driver Package (BSP) Sources] package.
you may dig into ATF source code, which is available by extracting atf_src.tbz2 package.
note, the build instructions of ATF is in the optee source package, nvidia-jetson-optee-source.tbz2,
the file name is… atf_and_optee_README.txt.

the logs in your bug description isn’t enough to dig further.
please setup serial console to gather the complete UART logs if you need further supports.
you may see-also https://www.youtube.com/watch?v=Kwpxhw41W50 for setting up serial console.

Hi JerryChang

That is I was looking for, I will marked as the solution.

BTW, the “bug” was expected during the test, so the device perform as it should. And yes, that error came from the debug UART of the device.

Thanks a lot.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.