Capturing RAS error reboot

Doing some testing we identify that when and RAS error happen and is not recoverable, the board itself issues a reboot. In this case this happen in both board that we tested, the Nvidia Xavier Industrial and the NX. Is there any way to capture the error in order not to reboot the board? The idea is to instead of rebooting the board we will want to launch a procedure to fix or mitigate the issue. Thanks for your time.

Issue seems similar as Reliability, Accessibility and Serviceability: RAS In Xavier triggering for verification - Jetson & Embedded Systems / Jetson AGX Xavier - NVIDIA Developer Forums

You could also refer to Reliability, Accessibility and Serviceability: RAS, ECC Error Detection and Correction - Jetson & Embedded Systems / Jetson AGX Xavier - NVIDIA Developer Forums

Hi, I also posted that question, but I think is a different topic. In the other question I’m referring to the part of reading the RAS error itself. In this case, my question in this post is regarding to the part on RAS that ,when an uncorrectable error happen, issues a reboot. What I want is to capture the reboot signal to do something else that a reboot, for example powering down the core that haves the error, or flushing the cache, etc…

It is recommended to reboot in case of Uncorrectable errors as they are fatal. For correctable errors, a reboot is not required.
For test purposes, you can set “is_debug=1” in func “ras_ccplex_serr_callback” of file “drivers/platform/tegra/carmel_ras.c”. This will avoid a crash then but may result in a crash later if not completely handled.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.