RAS ECC error log in JetPack 5

Hi all,

I am running an experiment where I would like to be able to look at ECC errors (both uncorrectable and correctable) on Orin. Previously in JetPack 4.6 on AGX Xavier these errors would be emitted as ras_ccplex_serr_callback in the kernel buffer. I see that in JP5 RAS has been moved to the ARM TrustedFirmware-A environment.

Some questions:

  1. Now that carmel_ras in the debug file system is gone, is there a simple way to access RAS ECC errors? Does rasdaemon still work to log RAS errors, or is there something even simpler I’m missing?
  2. I see a previous topic about inducing errors requiring changes to be made to the TrustedFirmware-A tests repository. Is there any way to do this through the kernel?

Thank you!

DRAM ECC is not supported on AGX Orin. It will be enabled for

We are using Orin Industrial modules. We just got them in.

Please reference to below topic.

Enabling RAS drivers on NX dev kit (Reliability, Accessibility and Serviceability) - Jetson & Embedded Systems / Jetson Xavier NX - NVIDIA Developer Forums

Thanks

I have seen that thread and it ends without a resolution, with the commenter’s last two questions not answered.

Is there simply no way to access these logs by default anymore? The functionality has been fully stripped from JP5?

Is there documentation on deploying the TF-A tests repo specifically to Jetson devices? It seems like the only documentation about it is the thread that you linked. I’m also unsure how it helps if it only runs on boot when I need to log ECC errors throughout device operation under different load conditions.

can you share your board boot log?

Yes, here is the boot log. Module is the Jetson AGX Orin Industrial on the devkit, flashed with Jetson Linux 35.3.1/JetPack 5.1.1 with the jetson-agx-orin-devkit-industrial target.

boot.log (91.8 KB)

Hi NVIDIA team,

We’ve decided we can induce errors ourselves at least using a local proton beam facility.

Can you confirm that we can still read correctable/uncorrectable ECC errors through dmesg please?

Thanks!

thanks for the dmesg log.
if you can collect the console uart log, then you will see in bootloader logs that ECC is enabled and what regions are being used for ECC.
The BSP that has been shipped is with Single Bit Error Correction support. The status of the same can be read from EMC registers but I need to get them added in the TRM. Also CCPLEX done have access to these registers. Once added, will update.

Hi Bibek,

Sorry, think I misunderstood you. Have attached the log from the UART console. ECC looks like it’s enabled for the entire memory region as expected.

boot_uart.txt (96.9 KB)

Good to know about the EMC. Is there any workaround to access the ECC status/errors for now? We’re looking to do some rad testing in August or so, it would be a great tool to have.

Thanks!

Have you had any luck or workarounds with this @shreeyam?

NVIDIA gave us a workaround but we have some open questions that maybe forum moderators can chime in with.

You can inject error and keep reading the EMC status register to see if reported:

Disable SCR_CONFIG to allow access to the error injection carveout:

# Comment out (temporarily) the SCR_CONFIG variable in Linux_for_Tegra/p3701.conf.common
# SCR_CONFIG=“tegra234-mb2-bct-scr-p3701-0000.dts”;

Continue reading the status register for testing. The value of one of these status register will increment if a SBE or DBE is generated.

This is only for dev testing. You should not go to production with disabling the SCR config.

# emc-print-emc-ecc-status.sh

echo “EMC_ECC_STATUS_0 (EMC0) = $(sudo ./devmem2 0x02c70ac4 | awk ‘/Value at address/{print $6}’)”

echo “EMC_ECC_STATUS_0 (EMC1) = $(sudo ./devmem2 0x02c80ac4 | awk ‘/Value at address/{print $6}’)”

echo “EMC_ECC_STATUS_0 (EMC2) = $(sudo ./devmem2 0x02c90ac4 | awk ‘/Value at address/{print $6}’)”

echo “EMC_ECC_STATUS_0 (EMC3) = $(sudo ./devmem2 0x02ca0ac4 | awk ‘/Value at address/{print $6}’)”

echo “EMC_ECC_STATUS_0 (EMC4) = $(sudo ./devmem2 0x02cb0ac4 | awk ‘/Value at address/{print $6}’)”

echo “EMC_ECC_STATUS_0 (EMC5) = $(sudo ./devmem2 0x02cc0ac4 | awk ‘/Value at address/{print $6}’)”

echo “EMC_ECC_STATUS_0 (EMC6) = $(sudo ./devmem2 0x02cd0ac4 | awk ‘/Value at address/{print $6}’)”

echo “EMC_ECC_STATUS_0 (EMC7) = $(sudo ./devmem2 0x02ce0ac4 | awk ‘/Value at address/{print $6}’)”

echo “EMC_ECC_STATUS_0 (EMC8) = $(sudo ./devmem2 0x01780ac4 | awk ‘/Value at address/{print $6}’)”

echo “EMC_ECC_STATUS_0 (EMC9) = $(sudo ./devmem2 0x01790ac4 | awk ‘/Value at address/{print $6}’)”

echo “EMC_ECC_STATUS_0 (EMC10) = $(sudo ./devmem2 0x017a0ac4 | awk ‘/Value at address/{print $6}’)”

echo “EMC_ECC_STATUS_0 (EMC11) = $(sudo ./devmem2 0x017b0ac4 | awk ‘/Value at address/{print $6}’)”

echo “EMC_ECC_STATUS_0 (EMC12) = $(sudo ./devmem2 0x017c0ac4 | awk ‘/Value at address/{print $6}’)”

echo “EMC_ECC_STATUS_0 (EMC13) = $(sudo ./devmem2 0x017d0ac4 | awk ‘/Value at address/{print $6}’)”

echo “EMC_ECC_STATUS_0 (EMC14) = $(sudo ./devmem2 0x017e0ac4 | awk ‘/Value at address/{print $6}’)”

echo “EMC_ECC_STATUS_0 (EMC15) = $(sudo ./devmem2 0x017f0ac4 | awk ‘/Value at address/{print $6}’)”

Some questions I still have:

  1. Does this work for CPU cache too? Intuitively if it’s for the EMC I think it wouldn’t but would like to confirm.
  2. I see that the Orin TRM needs updating with EMC documentation still. I would like to access ECC counts on JAXi as well as a comparison so if I could get confirmation that these memory addresses for the EMC are the same that would be great.