RAS ECC error log in JetPack 5

shreeyam · July 17, 2023, 6:49pm

Hi all,

I am running an experiment where I would like to be able to look at ECC errors (both uncorrectable and correctable) on Orin. Previously in JetPack 4.6 on AGX Xavier these errors would be emitted as ras_ccplex_serr_callback in the kernel buffer. I see that in JP5 RAS has been moved to the ARM TrustedFirmware-A environment.

Some questions:

Now that carmel_ras in the debug file system is gone, is there a simple way to access RAS ECC errors? Does rasdaemon still work to log RAS errors, or is there something even simpler I’m missing?
I see a previous topic about inducing errors requiring changes to be made to the TrustedFirmware-A tests repository. Is there any way to do this through the kernel?

Thank you!

Bibek · July 18, 2023, 9:38am

DRAM ECC is not supported on AGX Orin. It will be enabled for

shreeyam · July 18, 2023, 4:18pm

We are using Orin Industrial modules. We just got them in.

ShaneCCC · July 19, 2023, 4:17am

Please reference to below topic.

Enabling RAS drivers on NX dev kit (Reliability, Accessibility and Serviceability) - Jetson & Embedded Systems / Jetson Xavier NX - NVIDIA Developer Forums

Thanks

shreeyam · July 19, 2023, 7:28pm

I have seen that thread and it ends without a resolution, with the commenter’s last two questions not answered.

Is there simply no way to access these logs by default anymore? The functionality has been fully stripped from JP5?

Is there documentation on deploying the TF-A tests repo specifically to Jetson devices? It seems like the only documentation about it is the thread that you linked. I’m also unsure how it helps if it only runs on boot when I need to log ECC errors throughout device operation under different load conditions.

Bibek · July 20, 2023, 12:54pm

can you share your board boot log?

shreeyam · July 20, 2023, 11:56pm

Yes, here is the boot log. Module is the Jetson AGX Orin Industrial on the devkit, flashed with Jetson Linux 35.3.1/JetPack 5.1.1 with the jetson-agx-orin-devkit-industrial target.

boot.log (91.8 KB)

shreeyam · July 24, 2023, 9:17pm

Hi NVIDIA team,

We’ve decided we can induce errors ourselves at least using a local proton beam facility.

Can you confirm that we can still read correctable/uncorrectable ECC errors through dmesg please?

Thanks!

Bibek · July 25, 2023, 12:08pm

thanks for the dmesg log.
if you can collect the console uart log, then you will see in bootloader logs that ECC is enabled and what regions are being used for ECC.
The BSP that has been shipped is with Single Bit Error Correction support. The status of the same can be read from EMC registers but I need to get them added in the TRM. Also CCPLEX done have access to these registers. Once added, will update.

shreeyam · July 25, 2023, 6:12pm

Hi Bibek,

Sorry, think I misunderstood you. Have attached the log from the UART console. ECC looks like it’s enabled for the entire memory region as expected.

boot_uart.txt (96.9 KB)

Good to know about the EMC. Is there any workaround to access the ECC status/errors for now? We’re looking to do some rad testing in August or so, it would be a great tool to have.

Thanks!

vtaksheyev · August 15, 2023, 11:12pm

Have you had any luck or workarounds with this @shreeyam?

shreeyam · August 15, 2023, 11:32pm

NVIDIA gave us a workaround but we have some open questions that maybe forum moderators can chime in with.

You can inject error and keep reading the EMC status register to see if reported:

Disable SCR_CONFIG to allow access to the error injection carveout:

# Comment out (temporarily) the SCR_CONFIG variable in Linux_for_Tegra/p3701.conf.common
# SCR_CONFIG=“tegra234-mb2-bct-scr-p3701-0000.dts”;

Continue reading the status register for testing. The value of one of these status register will increment if a SBE or DBE is generated.

This is only for dev testing. You should not go to production with disabling the SCR config.

# emc-print-emc-ecc-status.sh

echo “EMC_ECC_STATUS_0 (EMC0) = $(sudo ./devmem2 0x02c70ac4 | awk ‘/Value at address/{print $6}’)”

echo “EMC_ECC_STATUS_0 (EMC1) = $(sudo ./devmem2 0x02c80ac4 | awk ‘/Value at address/{print $6}’)”

echo “EMC_ECC_STATUS_0 (EMC2) = $(sudo ./devmem2 0x02c90ac4 | awk ‘/Value at address/{print $6}’)”

echo “EMC_ECC_STATUS_0 (EMC3) = $(sudo ./devmem2 0x02ca0ac4 | awk ‘/Value at address/{print $6}’)”

echo “EMC_ECC_STATUS_0 (EMC4) = $(sudo ./devmem2 0x02cb0ac4 | awk ‘/Value at address/{print $6}’)”

echo “EMC_ECC_STATUS_0 (EMC5) = $(sudo ./devmem2 0x02cc0ac4 | awk ‘/Value at address/{print $6}’)”

echo “EMC_ECC_STATUS_0 (EMC6) = $(sudo ./devmem2 0x02cd0ac4 | awk ‘/Value at address/{print $6}’)”

echo “EMC_ECC_STATUS_0 (EMC7) = $(sudo ./devmem2 0x02ce0ac4 | awk ‘/Value at address/{print $6}’)”

echo “EMC_ECC_STATUS_0 (EMC8) = $(sudo ./devmem2 0x01780ac4 | awk ‘/Value at address/{print $6}’)”

echo “EMC_ECC_STATUS_0 (EMC9) = $(sudo ./devmem2 0x01790ac4 | awk ‘/Value at address/{print $6}’)”

echo “EMC_ECC_STATUS_0 (EMC10) = $(sudo ./devmem2 0x017a0ac4 | awk ‘/Value at address/{print $6}’)”

echo “EMC_ECC_STATUS_0 (EMC11) = $(sudo ./devmem2 0x017b0ac4 | awk ‘/Value at address/{print $6}’)”

echo “EMC_ECC_STATUS_0 (EMC12) = $(sudo ./devmem2 0x017c0ac4 | awk ‘/Value at address/{print $6}’)”

echo “EMC_ECC_STATUS_0 (EMC13) = $(sudo ./devmem2 0x017d0ac4 | awk ‘/Value at address/{print $6}’)”

echo “EMC_ECC_STATUS_0 (EMC14) = $(sudo ./devmem2 0x017e0ac4 | awk ‘/Value at address/{print $6}’)”

echo “EMC_ECC_STATUS_0 (EMC15) = $(sudo ./devmem2 0x017f0ac4 | awk ‘/Value at address/{print $6}’)”

Some questions I still have:

Does this work for CPU cache too? Intuitively if it’s for the EMC I think it wouldn’t but would like to confirm.
I see that the Orin TRM needs updating with EMC documentation still. I would like to access ECC counts on JAXi as well as a comparison so if I could get confirmation that these memory addresses for the EMC are the same that would be great.

Topic		Replies	Views
EMC Memory Addresses on AGX Xavier Industrial Jetson AGX Xavier kernel	3	379	August 28, 2023
AGX Orin Industrial ECC Information Jetson AGX Orin hw , kernel , ubuntu	6	885	August 14, 2023
RAS TEGRA_23x_SOC modification sources Jetson Orin NX security	13	334	April 22, 2024
Accessing DRAM Mode registers On Jetson AGX Orin Jetson AGX Orin kernel	8	800	September 20, 2023
PCIe C5 Endpoint Register Access Issue in UEFI Shell Jetson AGX Orin boot	18	904	October 12, 2022
Jetson Orin UEFI boot error Jetson AGX Orin boot , board-design	8	2778	November 9, 2022
Shared RAM on PCIe Endpoint Device: 'devmem: mmap:' error Jetson AGX Orin pcie	12	732	December 3, 2023
PCIe ep Test Fail on AGX orin:RP DMA address is null .Version:R36.3 Jetson AGX Orin pcie	14	115	November 21, 2024
How do I check or test Memory ECC feature on AGX Orin Industrial Jetson AGX Orin	12	29	May 8, 2025
Direct register access to Timer Jetson AGX Orin kernel	11	820	May 29, 2024

RAS ECC error log in JetPack 5

Related topics