Hi Nvidia community, I’m interested in using the Xavier NX’s built-in Reliability, Accessibility and Serviceability (RAS) drivers to understand ECC performance of the Carmel CPUs in an extreme environment.
2 goals :
(1) Report any RAS errors (both correctable & uncorrectable) to the serial UART2 debug interface. Also report the origin of the error i.e. CPU unit
(2) “Spoof” or trigger a RAS error to mimic the conditions the Jetson will live in, and validate that software recovery from the error happens correctly.
And please correct me if i’m wrong, but from the older posts on this topic + a paper I read, it seemed that the carmel_ras.c and arm64_ras.c were out-of-the-box implementations provided in Jetpack 4.x to capture RAS errors in UART debug logs (same as any other errors thrown). Injecting errors via the RAS_MCA_ERR-trip node could also be done out-of-the-box.
I’m not seeing any analogous code examples though in the ATF docs. I’m more interested in seeing the errors as debug logs post-runtime rather than polling RAS registers real-time as errors happen.
Would I write my own C driver to capture & inject errors to replicate the functionality that was provided in Jetpack 4.x?
If so, is there a way to port the old drivers over to Jetpack 5.x to avoid re-doing this from scratch?