We are evaluating Jetson AGX units for use in a moderately high radiation environment. We would like to monitor the ECC memory correction stats to determine when radiation-induced bit flips take place. However, we don’t know if/how this is possible on the Jetson. With desktop PCs, libnvidia-ml.so offers this capability, but that library is not supported on Jetson. I found this post, which may or may not point in a helpful direction:
Bottom line: my question is, how can we monitor the Jetson Industrial AGX’s ECC RAM correction statistics? Thanks very much!
on TX2i, DRAM ECC interrupt handling is done by SCE which is a R5 with closed source. So, you will not be able to see the error stats.
If you chose to use linux kernel for dram ecc error interrupt, then you can see it in console. But there will be no mitigation in linux driver for double bit error, single bit error is take care by the HW itself. More explanation on how to enable linux kernel driver is here RAM ECC TX2i
Thanks Bibek. As I stated above, we have Jetson AGX industrial units. Not TX2i’s. Does the information you provided also apply to the AGX and./or AGX industrial family?
In case of double bit errors detected in the HW, the system reboots and the bad page is blacklisted by the dram-ecc binary. You can check for the boot log prints from the same during reboot for the blacklisted pages.