[Jetson-TK1] Bad flash (/dev/mmcblk0p1) and how to fix

The last friday we had a power outage in our lab. When booting our Jetson-TK1 kit, we discovered that there were file system errors in the root fs (/dev/mmcblk0p1).

Following the specifications here:
[url]http://elinux.org/Jetson/Cloning[/url]
we then read out the entire system.img partition, mounted it, run e2fsck -c -f -p, and re-write it back to the Jetson-TK1.

While this temporarily fixed the issues we were seeing, we still get a corrupted file system after 1-2 days of normal usage. It was evident today, because we were no longer able to ssh into the Jetson-TK1 - our user was gone, and by checking the local display we found we had a lot of inode errors. Very interesting…

I think the flash memory is bad, but I was wondering if anyone can come with some suggestions / comments? I am currently running badblocks -w /dev/mmcblk0p1 directly on the Jetson-TK1 (booting from SDMMC), but I am open for input.

If you expect file system errors, you might want to create a “rescue sd card” ahead of time (you already have this, others might find it useful).

You can unpack the sample rootfs onto an SD card, then run apply_binaries.sh with the “–root=some/sdcard/path” option from your host. Then edit your eMMC copy of “/boot/extlinux/extlinux.conf” (the SD card /boot version can be used for backup, but the eMMC /boot is used for actual boot) and add a second non-default entry for SD card to boot exactly as the original, except naming root as “/dev/mmcblk1p1”.

From that time forward you can power down, insert SD card, boot to serial console prompt, pick SD card to boot to, and then freely repair or inspect /dev/mmcblk0p1 without the clone hassle. You can even create a more-or-less exact copy of your running system (less meta files like /sys and /proc plus /etc/fstab changes) as a backup. Having booted to SD card functions like e2fsck are not restricted for eMMC.

So far as unit testing goes, if you are concerned that the hardware has damaged eMMC, you could perform a series of clones and re-write from clone and compare each clone’s sha1sum. Checksum will fail to match if the journalling file system is ever mounted, as journal date stamps will increment even when mounted read-only. Thus the testing would require repeated clone/write without the file system ever mounting. In this case checksum should be the same for each clone after repeated clone/rewrite (it’s very important to emphasize any file system mount of any kind will invalidate check sums, even on loopback).