How to monitor ECC RAM correction stats on Jetson AGX?

We are evaluating Jetson AGX units for use in a moderately high radiation environment. We would like to monitor the ECC memory correction stats to determine when radiation-induced bit flips take place. However, we don’t know if/how this is possible on the Jetson. With desktop PCs, libnvidia-ml.so offers this capability, but that library is not supported on Jetson. I found this post, which may or may not point in a helpful direction:

Bottom line: my question is, how can we monitor the Jetson Industrial AGX’s ECC RAM correction statistics? Thanks very much!

on TX2i, DRAM ECC interrupt handling is done by SCE which is a R5 with closed source. So, you will not be able to see the error stats.

If you chose to use linux kernel for dram ecc error interrupt, then you can see it in console. But there will be no mitigation in linux driver for double bit error, single bit error is take care by the HW itself. More explanation on how to enable linux kernel driver is here RAM ECC TX2i

Thanks Bibek. As I stated above, we have Jetson AGX industrial units. Not TX2i’s. Does the information you provided also apply to the AGX and./or AGX industrial family?

The dram ECC feature is supported on Jetson AGX Industrial units by default. It works similarly to tx2i. The SCE Engine monitors the ecc errors and takes appropriate action. Please refer to NVIDIA Jetson Linux Driver Package Software Features : Hardware Setup | NVIDIA Docs for setup details.

In case of double bit errors detected in the HW, the system reboots and the bad page is blacklisted by the dram-ecc binary. You can check for the boot log prints from the same during reboot for the blacklisted pages.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.