How can I obtain the number of correctable and uncorrectable errors from the AGX Orin Industrial?
Correctable errors are handled by HW itself. Though there are registers to know the count, CCPLEX don’t have access to these registers. Another R5 called FSI and its FW manage the uncorrectable error handling by maintaining the bad pages. You will see boot time logs related to ECC when bootloader is booting.
Thanks for the reply.
Besides reading those registers (which seem inaccessible from application-space), are there any other interfaces for obtaining that count? On other computers with ECC I’ve been able to use Error Detection and Correction (EDAC) drivers that interface with the hardware and expose ECC errors via /sys/devices/system/edac.
My goal is to get access to correctable and uncorrectable ECC error counts from the application space. Similar to this thread: RAS ECC error log in JetPack 5 - #7 by Bibek
EDAC driver is not supported since this driver runs from the same RAM which could get these ECC errors while executing.
There is no way to get this data as of now on CCPLEX. We can think about giving permission to ccplex or transferring the data from fsi to ccplex for the application to read it. Please provide what is the end goal/usecase after reading the data?
The ultimate usecase is to have a real-time ECC error counter during radiation testing of the Orin Industrial - specifically, to have correctable and uncorrectable ECC errors logged / transmitted in real-time throughout the duration of the test.
I saw that in the current DRAM ECC support for the Orin Industrial, “when a double-bit error correction is detected, the system reboots”, so only getting the number of correctable ECC errors would be sufficient for now.
It seems like for the Xavier Industrial with Jetpack 4.x, the RAS driver could be used to understand ECC performance, but from Jetpack 5.1, the RAS driver has been moved from Kernel to ARM Trusted Firmware (ATF). Does ATF have access to these registers, or is there completely no way to do what I’m describing currently?
Scanning the ATF docs (4.11. Reliability, Availability, and Serviceability (RAS) Extensions — Trusted Firmware-A 2.9.0 documentation), I haven’t been able to find any information on obtaining the uncorrectable / correctable ECC count I described about.
If there are no ways to get access to this information, how do I turn off DRAM/Cache ECC for the Orin Industrial so that I can run a memory scan program and look at bitflips/memory disparity from the application space?
I saw this link To change the ECC state but it seems to only apply to GPU ECC
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.