Corrupted SD cards

Based on the inode numbers, the corrupted files are:

/usr/sbin/tee-supplicant
/usr/sbin/ethtool
/usr/sbin/nvfancontrol
/usr/sbin/modinfo
/usr/sbin/ip6tables-legacy

The inodes seem close to eachother.

Hi @cozzmy13,

Just to clarify. Did you do anything in that first boot of your uart log?

I mean running any kind of application? This issue seems pretty easy to get reproduced on your side.

Yep, opened qv4l2 and tested image capture. Didn’t save anything to disk myself, but I’m sure the OS saves some stuff to disk by itself.

I also copied new kernel, modules, and dtb, before rebooting.

Hi,

I know that you think this is NVIDIA responsibility to debug issue. But I really need a way to reproduce this issue.

Could you share the exact steps about how to run this qv4l2 application? or any alternative way that can reproduce on your side is also okay to me. Just need a step-by-step instructions.

Sorry that I don’t know much about qv4l2 things as I don’t work on this part.

qv4l2 is just a frontend for v4l2, allowing you to capture frames from camera with a UI. I don’t think it’s even related to the issue, but that’s just the workflow I have.

Ok, so if I use any kind of v4l2 pipeline or application on default image + devkit, you expect this situation would happen too, right?

I don’t think it’s even necessary to test v4l2 together with this issue.

You can try copying /boot/Image and /usr/lib/modules/5.10.104-tegra/ and rebooting over and over again.

I use rsync with a few flags so that modules aren’t even copied all the time, only when they change.

Ok, I guess I just need to use dd to write large file consecutively and see what would happen.

/boot/Image is only a 33MB file. But I sync and then reboot each time after writing it.

Yes, I know. I replace kernel image almost everyday but didn’t hit such issue… probably not frequently enough.

Incidentally, I did not see the first boot showing any sign of corruption. Shutdown was not logging enough to say if during shutdown something was wrong, but sometimes a process hangs and sync cannot occur (that’s just a guess though). Then, on the next boot, it skips fsck, and yet when it turns on the journal (which is when journal repair of cached but not completely written content would occur), it does detect a lot of invalid checksums. I can’t be certain, but my guess is that this is from the previous shutdown at a stage where serial console is not logging (maybe someone from NVIDIA knows how to make shutdown serial console logging more verbose?).

It sounds like you are already familiar with inodes. For the inodes which are mentioned, can you run “sudo find / -inum <inode number>” on each of them? Or is this already the way you found the files before?

Yes, that’s how I found them.

Those particular files would not normally be changing. They can if there is an update. The problem though is that this is a “tree” structure, and one or more nodes are owned by a parent node (think of files inside of a directory, or bytes within a file; each is just data with a pointer to its owner). So if anything is written in “/usr/sbin”, or in any of its subdirectories which might update a timestamp, so on, it is possible to corrupt the parent and the parent’s child nodes (metadata such as timestamps and permissions are still edits even if the bytes don’t change…something on the disk changes even if it is only metadata).

I think logging of shutdown would need to be more verbose to figure out what is hanging up. I think it is a hung process not letting the filesystem unmount. Maybe someone from NVIDIA could say how to get verbose serial console logging for more of shutdown.

Only thing that comes to mind is the fact that the rootfs has access timestamps turned on and these files are accessed at boot.

Now it managed to corrupt /mnt, /proc, /sys directories, making the sdcard unbootable even after running fsck. This is outrageous.

I am not surprised if usage shows access at boot. What I would be surprised at is if the file is written during boot (if the file itself has a write timestamp change). After a corrected corruption would be the wrong time for this, but when it is correct and not corrupted, you could take the sha1sum of each of those files. Then, after a reboot, if anything has changed (or even if timestamps did not change), you could compare the sha1sum again. This would tell you if content changed (different than metadata change, especially if something is broken; timestamp and content change behavior is only guaranteed when there is no bug).

I want to differentiate between corrected failures and uncorrected failures. When corrected, parts of the system will go missing based on the journal removing those. When issues exceed journal storage size, this is uncorrected and will result in messages about removing inodes or needing to repair before mounting. Something which is specific in which the same file or directory is changed each time implies there is a process causing the issue. Something in which different files and/or directories are hit each occurrence implies it is a wider spread issue not necessarily tied to a single process (it could be, but it is much less likely; for example, if one hangs a process which is tied to a specific file, then it is likely that file will corrupt each time…perhaps other content related to it in the parent will also corrupt, but “/mnt” is unrelated to anything in “/usr/sbin”, so it is unlikely to be a single hung process).

A bad SD card can do this. Failing solid state memory tends to behave like this. However, you’ve tried multiple brands of SD card, so this is less likely (not impossible, but unlikely).

On a desktop PC, if there is a brownout condition without an UPS which handles brownout (low but still sufficient voltage) memory writes tend to corrupt. One could be writing a huge file during a brownout, and there would be no apparent error, but then accessing the file later would show errors. This is neither a hardware nor software error, yet it is hardware related. It is quite difficult to tell sometimes. Jetsons tend to simply shut down if their voltage is not well regulated, so it is unlikely a brownout. Still, if there is some reason why there might be sudden power consumption swings, there is a small possibility that this could cause corruption. I really doubt that is the case though.

Related to the power issue, perhaps a power rail regulator is misbehaved. I think you have stumbled on a symptom from the scp and rsync, but that it is likely to not have been caused directly by software. Other than having more detailed shutdown logs to see when corruption occurs, there isn’t a lot I can suggest.

If NVIDIA can suggest how to increase serial console logging during shutdown, then this might show a hint.

@cozzmy13 I’m brining up an Orin Nano dev kit with SD card and experiencing very similar behavior. In my setup I can boot from the SD or a partition on nvme to repair the corrupted file system. The corruption started small and then resulted in failed networking which is what made me notice in the first place. After a repair using fsck the corruption returns and grows.

I found another thread on the forum that mentioned that the QSPI needed to be flashed with the same jetpack as what is on the SD card. I don’t know enough about the QSPI to understand why that would matter but thought I would connect the threads.

In all my years of using the excellent Jetson boards I’ve never seen these filesystem corruptions on the SD card. Makes me very nervous to ship systems until this issue is resolved or the root cause identified.

Below is serial console output when copying from the corrupted SD to nvme to create the recovery image

[  316.904597] EXT4-fs (mmcblk1p1): error count since last fsck: 77
[  316.910818] EXT4-fs (mmcblk1p1): initial error at time 1697612057: ext4_lookup:1706: inode 21240
[  316.919911] EXT4-fs (mmcblk1p1): laor at time 1697635080: ext4_lookup:1706: inode 21240
[  801.215104] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #527105: comm cp: iget: checksum invalid
[  801.227148] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #527106: comm cp: iget: checksum invalid
[  801.238894] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #527107: comm cp: iget: checksum invalid
[  801.250598] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #527109: comm cp: iget: checksum invalid
[  905.561904] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #21242: comm systemd-tmpfile: iget: checksum invalid
[  906.667289] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #21247: comm systemd-tmpfile: iget: checksum invalid
[  906.700142] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #21244: comm systemd-tmpfile: iget: checksum invalid
[  997.890842] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #527108: comm cp: iget: checksum invalid
[ 1041.428768] EXT4-fs errovice mmcblk1p1): ext4_lookup:1706: inode #21240: comm cp: iget: checksum invalid
[ 1049.892659] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #21233: comm cp: iget: checksum invalid
[ 1049.904235] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #21234: comm cp: iget: checksum invalid
[ 1049.915691] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #21235: comm cp: iget: checksum invalid
[ 1049.927055] EXT4-fs error (devicblk1p1): ext4_lookup:1706: inode #21236: comm cp: iget: checksum invalid
[ 1049.938431] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #21237: comm cp: iget: checksum invalid
[ 1049.949942] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #21238: comm cp: iget: checksum invalid
[ 1049.961451] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #21239: comm cp: iget: checksum invalid
[ 1050.221998] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #21242: comm cp: ihecksum invalid
[ 1050.236574] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #21244: comm cp: iget: checksum invalid
[ 1050.388168] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #21247: comm cp: iget: checksum invalid

I saw this in an earlier post, and just wanted to add a comment:

/proc” and “/sys” are not real files, and it isn’t possible for them to corrupt. Those pseudo-files are drivers pretending to be files, and they exist in RAM whenever the driver loads. Something else is going on. If your SD card has those directories corrupted, then it is because files were copied there which should never be copied as files. This is filesystem type “/proc”, not “ext4”.

The “/mnt” would normally be available for any number of purposes. There might be subdirectories created and then other media mounted there; or a mount directly on that location would temporarily make the content become what is on that other media.

Anyone copying from a corrupted media to another media will get errors or corruption which is preserved.

Jetsons do not have a BIOS. The setup of power rails and clocks and other devices occurs in software. A true full flash of a Jetson flashes not just operating system and bootloader, but also flashes the equivalent of BIOS. On eMMC models there are a lot of partitions, and those partitions contain that content; for SD card models, this content is in QSPI memory on the module itself. It isn’t unusual in the desktop PC world that a motherboard requires a CMOS BIOS flash for newer hardware, e.g., an older BIOS might need update to use faster RAM or a new CPU model. Jetsons are no different, and perhaps slightly touchier in those regards because the boot bring-up is such a “manual” process, and the environment inherited by the Linux kernel can depend on that setup; changing kernel content or configuration can trigger a need to change something in that environment (in the QSPI).

One of the reasons to clone a device in either read-only mode, or unmounted, is to avoid access errors. When you use dd of an unmounted partition or file, then you know the content will not change during the copy. rsync is also good about this, but that’s because it understands details about the filesystem (it isn’t just a blind copy). Even so, rsync may not be as good since it might have to adjust to something that changes during backup. The down side of dd and clone is that if there is a preexisting corruption, then this too is faithfully created…dd and clone are copying binary data in an exact bit-for-bit copy without any interpretation.

I will say that 77 errors on an SD card (or any filesystem) is a lot of errors. One needs to find out why it was corrupted to start with; it isn’t the copy or clone which causes the corruption, but the copy or clone can give you what you started with.

Probably the leading cause of corruption is incorrect shutdown, e.g., pulling power instead of a proper shutdown command. Hung tasks which prevent umount during a proper shutdown can also cause this.

Are you sure? I’m pretty sure that these directories actually exist in the rootfs, but the sysfs & others are mounted inside them. The directory entries themselves are real. Which is what was corrupted.

In the meantime we switched to NVME and I haven’t had it occur.