Xavier - validating the Flash Image

The NVIDIA Xavier has a 32GB Flash. Is there a tool or bare metal application that can be used in validating the flash image is still good? I was thinking once a week I can do a flash memory integrity check. Is there a tool that is recommended use on Linux?

Thank you ahead of time.

Hi

Do you mean you want to check the flashed rootfs healthy, or the system img(*.img) file, or the memory of a running system?

Hi,
I would like to check the flashed rootfs and check the /dev/mmcblk0p1 is still “healthy”. I also have a NVMe M.2 2TB flash installed on my Xavier System. I would like to check if that is “healthy” also. Does NVIDIA recommends a tool to run some test while the Linux system is running?

Thanks ahead of time.

Are you looking for something like “SMART drive” hardware health monitoring? Or just filesystem integrity? Or are you looking for some sort of package checksum test?

I was thinking about an OS/Image memory validation. Over time and use, flash memory will degrade causing bit flips which in-turn can lead to unexpected operations. One method that’s pretty common is to run a CRC on the entire memory of the OS and one of the Image. You compare the calculated CRC with a known CRC value that’s calculated when the device is first programmed.
The “SMART drive” link looks like the right direction. The tools is similar to “nvme” or “smartctl” command tools.
Has anyone use the “smartctl” to check the “health” of both the /dev/nvme0n1 and /dev/mmcblk0p1?

You might find package “smartmontools” is of interest. This of course only applies to drives supporting “SMART drive”, but this has been very useful in the past. I am not sure what changes were made when transitioning tech from mechanical hard drives to solid state, but I do think this should work with NVMe (either “-d sat” or “-d ata” option, and “-i /dev/someDriveName”).

eMMC is itself not SATA, and so I don’t know of any way to verify this. It would be interesting to find out if there is some sort of error statistic. Anyone know about error statistics checking for eMMC?

After some investigation, here is info I gathered, see below. What do you guys think? So far, I could not find any CRC information that I can use. So here is the data I can probably use to determine the “health” of the devices.

For the NVMe, I can use the following commands:

o smartctl -x /dev/nvme0 | grep overall-health
to get an overall health test result. Here is an example output:

   SMART overall-health self-assessment test result: PASSED

o smartctl -x /dev/nvme0 | grep “Critical Warning”
to monitor any warnings: Here is an example output:

    Critical Warning:                   0x00

For the eMMC, there is a tool called “mmc” to get the current stats for the eMMC.
for example:

*. mmc extcsd read /dev/mmcblk0 | egrep “LIFE|EOL”

eMMC Life Time Estimation A [EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_A]: 0x01

eMMC Life Time Estimation B [EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_B]: 0x01

eMMC Pre EOL information [EXT_CSD_PRE_EOL_INFO]: 0x01

and this is the definitions I gathered:

Device Life Time Estimation Type A:

Health Status in Increments of 10 %

Refers to pSLC Blocks in our eMMC

Device Life Time Estimation Type B:

Health Status in Increments of 10 %

Refers to MLC Blocks in our eMMC

Pre-EOL Information:

Normal: Up to 80 % of Reserved Blocks Consumed

Warning: More than 80 % Consumed

Urgent: More than 90 % Consumed

I suspect the smartctl commands will do what you want. Don’t know about whether the eMMC info is valid or not (someone from NVIDIA may know).

There wouldn’t be any kind of useful CRC information from directly accessing storage. Any such useful information would need to come from the controller which corrects errors before a read/write, and I have never even glanced at how this might be available. Perhaps this is the basis for how the eMMC information you obtained works (counts of internal CRC fixes/intervention)…don’t know.