Corrupted SD cards

Ok, so if I use any kind of v4l2 pipeline or application on default image + devkit, you expect this situation would happen too, right?

I don’t think it’s even necessary to test v4l2 together with this issue.

You can try copying /boot/Image and /usr/lib/modules/5.10.104-tegra/ and rebooting over and over again.

I use rsync with a few flags so that modules aren’t even copied all the time, only when they change.

Ok, I guess I just need to use dd to write large file consecutively and see what would happen.

/boot/Image is only a 33MB file. But I sync and then reboot each time after writing it.

Yes, I know. I replace kernel image almost everyday but didn’t hit such issue… probably not frequently enough.

Incidentally, I did not see the first boot showing any sign of corruption. Shutdown was not logging enough to say if during shutdown something was wrong, but sometimes a process hangs and sync cannot occur (that’s just a guess though). Then, on the next boot, it skips fsck, and yet when it turns on the journal (which is when journal repair of cached but not completely written content would occur), it does detect a lot of invalid checksums. I can’t be certain, but my guess is that this is from the previous shutdown at a stage where serial console is not logging (maybe someone from NVIDIA knows how to make shutdown serial console logging more verbose?).

It sounds like you are already familiar with inodes. For the inodes which are mentioned, can you run “sudo find / -inum <inode number>” on each of them? Or is this already the way you found the files before?

Yes, that’s how I found them.

Those particular files would not normally be changing. They can if there is an update. The problem though is that this is a “tree” structure, and one or more nodes are owned by a parent node (think of files inside of a directory, or bytes within a file; each is just data with a pointer to its owner). So if anything is written in “/usr/sbin”, or in any of its subdirectories which might update a timestamp, so on, it is possible to corrupt the parent and the parent’s child nodes (metadata such as timestamps and permissions are still edits even if the bytes don’t change…something on the disk changes even if it is only metadata).

I think logging of shutdown would need to be more verbose to figure out what is hanging up. I think it is a hung process not letting the filesystem unmount. Maybe someone from NVIDIA could say how to get verbose serial console logging for more of shutdown.

Only thing that comes to mind is the fact that the rootfs has access timestamps turned on and these files are accessed at boot.

Now it managed to corrupt /mnt, /proc, /sys directories, making the sdcard unbootable even after running fsck. This is outrageous.

I am not surprised if usage shows access at boot. What I would be surprised at is if the file is written during boot (if the file itself has a write timestamp change). After a corrected corruption would be the wrong time for this, but when it is correct and not corrupted, you could take the sha1sum of each of those files. Then, after a reboot, if anything has changed (or even if timestamps did not change), you could compare the sha1sum again. This would tell you if content changed (different than metadata change, especially if something is broken; timestamp and content change behavior is only guaranteed when there is no bug).

I want to differentiate between corrected failures and uncorrected failures. When corrected, parts of the system will go missing based on the journal removing those. When issues exceed journal storage size, this is uncorrected and will result in messages about removing inodes or needing to repair before mounting. Something which is specific in which the same file or directory is changed each time implies there is a process causing the issue. Something in which different files and/or directories are hit each occurrence implies it is a wider spread issue not necessarily tied to a single process (it could be, but it is much less likely; for example, if one hangs a process which is tied to a specific file, then it is likely that file will corrupt each time…perhaps other content related to it in the parent will also corrupt, but “/mnt” is unrelated to anything in “/usr/sbin”, so it is unlikely to be a single hung process).

A bad SD card can do this. Failing solid state memory tends to behave like this. However, you’ve tried multiple brands of SD card, so this is less likely (not impossible, but unlikely).

On a desktop PC, if there is a brownout condition without an UPS which handles brownout (low but still sufficient voltage) memory writes tend to corrupt. One could be writing a huge file during a brownout, and there would be no apparent error, but then accessing the file later would show errors. This is neither a hardware nor software error, yet it is hardware related. It is quite difficult to tell sometimes. Jetsons tend to simply shut down if their voltage is not well regulated, so it is unlikely a brownout. Still, if there is some reason why there might be sudden power consumption swings, there is a small possibility that this could cause corruption. I really doubt that is the case though.

Related to the power issue, perhaps a power rail regulator is misbehaved. I think you have stumbled on a symptom from the scp and rsync, but that it is likely to not have been caused directly by software. Other than having more detailed shutdown logs to see when corruption occurs, there isn’t a lot I can suggest.

If NVIDIA can suggest how to increase serial console logging during shutdown, then this might show a hint.

@cozzmy13 I’m brining up an Orin Nano dev kit with SD card and experiencing very similar behavior. In my setup I can boot from the SD or a partition on nvme to repair the corrupted file system. The corruption started small and then resulted in failed networking which is what made me notice in the first place. After a repair using fsck the corruption returns and grows.

I found another thread on the forum that mentioned that the QSPI needed to be flashed with the same jetpack as what is on the SD card. I don’t know enough about the QSPI to understand why that would matter but thought I would connect the threads.

In all my years of using the excellent Jetson boards I’ve never seen these filesystem corruptions on the SD card. Makes me very nervous to ship systems until this issue is resolved or the root cause identified.

Below is serial console output when copying from the corrupted SD to nvme to create the recovery image

[  316.904597] EXT4-fs (mmcblk1p1): error count since last fsck: 77
[  316.910818] EXT4-fs (mmcblk1p1): initial error at time 1697612057: ext4_lookup:1706: inode 21240
[  316.919911] EXT4-fs (mmcblk1p1): laor at time 1697635080: ext4_lookup:1706: inode 21240
[  801.215104] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #527105: comm cp: iget: checksum invalid
[  801.227148] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #527106: comm cp: iget: checksum invalid
[  801.238894] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #527107: comm cp: iget: checksum invalid
[  801.250598] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #527109: comm cp: iget: checksum invalid
[  905.561904] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #21242: comm systemd-tmpfile: iget: checksum invalid
[  906.667289] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #21247: comm systemd-tmpfile: iget: checksum invalid
[  906.700142] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #21244: comm systemd-tmpfile: iget: checksum invalid
[  997.890842] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #527108: comm cp: iget: checksum invalid
[ 1041.428768] EXT4-fs errovice mmcblk1p1): ext4_lookup:1706: inode #21240: comm cp: iget: checksum invalid
[ 1049.892659] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #21233: comm cp: iget: checksum invalid
[ 1049.904235] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #21234: comm cp: iget: checksum invalid
[ 1049.915691] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #21235: comm cp: iget: checksum invalid
[ 1049.927055] EXT4-fs error (devicblk1p1): ext4_lookup:1706: inode #21236: comm cp: iget: checksum invalid
[ 1049.938431] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #21237: comm cp: iget: checksum invalid
[ 1049.949942] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #21238: comm cp: iget: checksum invalid
[ 1049.961451] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #21239: comm cp: iget: checksum invalid
[ 1050.221998] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #21242: comm cp: ihecksum invalid
[ 1050.236574] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #21244: comm cp: iget: checksum invalid
[ 1050.388168] EXT4-fs error (device mmcblk1p1): ext4_lookup:1706: inode #21247: comm cp: iget: checksum invalid

I saw this in an earlier post, and just wanted to add a comment:

/proc” and “/sys” are not real files, and it isn’t possible for them to corrupt. Those pseudo-files are drivers pretending to be files, and they exist in RAM whenever the driver loads. Something else is going on. If your SD card has those directories corrupted, then it is because files were copied there which should never be copied as files. This is filesystem type “/proc”, not “ext4”.

The “/mnt” would normally be available for any number of purposes. There might be subdirectories created and then other media mounted there; or a mount directly on that location would temporarily make the content become what is on that other media.

Anyone copying from a corrupted media to another media will get errors or corruption which is preserved.

Jetsons do not have a BIOS. The setup of power rails and clocks and other devices occurs in software. A true full flash of a Jetson flashes not just operating system and bootloader, but also flashes the equivalent of BIOS. On eMMC models there are a lot of partitions, and those partitions contain that content; for SD card models, this content is in QSPI memory on the module itself. It isn’t unusual in the desktop PC world that a motherboard requires a CMOS BIOS flash for newer hardware, e.g., an older BIOS might need update to use faster RAM or a new CPU model. Jetsons are no different, and perhaps slightly touchier in those regards because the boot bring-up is such a “manual” process, and the environment inherited by the Linux kernel can depend on that setup; changing kernel content or configuration can trigger a need to change something in that environment (in the QSPI).

One of the reasons to clone a device in either read-only mode, or unmounted, is to avoid access errors. When you use dd of an unmounted partition or file, then you know the content will not change during the copy. rsync is also good about this, but that’s because it understands details about the filesystem (it isn’t just a blind copy). Even so, rsync may not be as good since it might have to adjust to something that changes during backup. The down side of dd and clone is that if there is a preexisting corruption, then this too is faithfully created…dd and clone are copying binary data in an exact bit-for-bit copy without any interpretation.

I will say that 77 errors on an SD card (or any filesystem) is a lot of errors. One needs to find out why it was corrupted to start with; it isn’t the copy or clone which causes the corruption, but the copy or clone can give you what you started with.

Probably the leading cause of corruption is incorrect shutdown, e.g., pulling power instead of a proper shutdown command. Hung tasks which prevent umount during a proper shutdown can also cause this.

Are you sure? I’m pretty sure that these directories actually exist in the rootfs, but the sysfs & others are mounted inside them. The directory entries themselves are real. Which is what was corrupted.

In the meantime we switched to NVME and I haven’t had it occur.

This card has not been subjected to any power cycles without proper shutdown so I suspect that it’s a hung task. Is there a good way to check this?

Over the last 10 years I’ve seen jetsons survive 10s if not 100s of hard power cycles without a single filesystem corruption on the SD card. So it seems odd that it suddenly happens frequently. Perhaps it’s a bad batch of SD cards. I’ll see if I can rule that out by testing some other batches.

If you go to your host PC where you flashed the content, you will have this directory:

Within that, go to subdirectory “rootfs/sys/”. Run ls. Is there content there? This is what creates any actual disk content during a flash. Anything not there should be dynamically generated at runtime. The runtime content is in RAM and does not exist on the disk.

If for some reason there does happen to be a file in “/sys”, then if one mounts another filesystem there (regardless of whether it is real, e.g., ext4, or virtual, e.g., tempfs), then the old content is hidden until umount of that filesystem which sits on top of it. Did your flash software on the host PC have content in the “rootfs/sys/”?

Please take a second to understand what I said. I know that sysfs is a virtual filesystem and is not backed on storage, but the /sys directory itself is. Not its contents. The /sys directory is what got corrupted.

If something were written to “/sys” before the pseudo filesystem was mounted there, then this could corrupt the SD card part of it. As soon as “/sys” mounts, that part of the SD card would become invisible. It isn’t possible for the pseudo “/sys” to become corrupted unless the actual kernel code is altered. The part which I have never considered before is that if (A) the underlying ext4 filesystem of the SD card is corrupted, and then (B) a pseudo filesystem is mounted over this, what would happen if “/sys” were mounted over a corrupt ext4 mount point? I could see the latter appearing as a corrupt “/sys” via the corrupt mount point, but otherwise the only way I could see “/sys” being corrupt is through altering of the kernel itself.

If you were to monitor “dmesg --follow” on a different Linux host PC, and then insert the SD card, does it show corruption? If so, does it show this as repairable? If repair is successful, does use as an SD card on the Jetson cause it to recorrupt (I’d get a full serial console boot log the first time a repaired SD card is used on the Jetson)?

In this case I’ll reiterate what is very important: The mount point of “/sys” exists on the ext4 filesystem, and is part of the sample rootfs (which on the host PC which flashes is “Linux_for_Tegra/rootfs/sys/”). This should be empty on any SD card. This does serve as a mount point for the pseudo filesystem. Somehow we have to distinguish between whether the empty mount point is corrupt, and thus causing “/sys” to inherit that corruption, versus whether something is saying “/sys” itself is corrupt. The pseudo filesystem is not ext4, and some of the concepts of ext4 corruption are alien to the sysfs filesystem type. In no case is anything from a sysfs filesystem ever saved to disk. I do not know if a corrupt mount point which a sysfs filesystem is mounted to would inherit corruption.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.