RAS FHI error on xavier and suddenly reboot

Hi, NV team,
We met xavier suddenly reboot problem on our own design carrier board.
JP version:# R32 (release), REVISION: 4.4, GCID: 23942405, BOARD: t186ref, EABI: aarch64, DATE: Fri Oct 16 19:37:08 UTC 2020

Before reboot ,there are many RAS errors in the kernel log, just as shown below.
My question are:

  1. what do these errors mean? Is this what caused the system to reboot?
  2. what causes xavier to reboot in general?
  3. what can we do for further debug?

Thanks.


kern.log (116.8 KB)
attached the kernel log file

How did you get the RAS FHI error? Any test running on device?
The JetPack version 4.4 is a bit old.
Also this issue happened on your custom board, can you reproduce it at newer version, such as JetPack 4.6?

See also: Capturing RAS error reboot - Jetson & Embedded Systems / Jetson AGX Xavier - NVIDIA Developer Forums

Hi, That errors are the same that if the memory is under radiation testing. I don’t think is the case.
But could be that you are writing the register that triggers RAS errors or that your custom board is writing on that address.

Try to see if you a writhing to this address.
sys/kernel/debug/carmel_ras/RAS_MCA_ERR-trip
0xff00000000

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.