Corrupted SD cards

Hello. I’ve been experiencing corruptions on multiple types and brands of SD cards on the Orin Nano developer kit. The corruptions happen after safe shutdowns or reboots, without unplugging the power cable. My workflow involves copying a new kernel and modules via SSH (scp and rsync), running the sync command, and then the reboot command. I get corruptions every so often and it’s very frustrating having to reflash the board each time.

I can see similar topics on this forum such as:

@linuxdev mentioned grabbing some logs in that topic, so here they are.

syslog (10.1 MB)
kern.log (3.9 MB)
dmesg (94.3 KB)

Hi

You should use this method to dump log.

syslog and dmesg cannot capture error log after system corrupted.

Also, could you directly reflash the whole board just with pure image from sdkmanager instead of your kernel and dtb?

You should use this method to dump log.

If you look at the logs I sent you, they contain the corruption. My serial console shows exactly the same thing.

Also, could you directly reflash the whole board just with pure image from sdkmanager instead of your kernel and dtb?

That’s what I do each time it corrupts. And then I put my kernel on it. Which is just the nvidia kernel with patches for our V4L2 I2C devices.

What do I use the board for if I don’t put my software on it?

Hi,

The point here is to clarify the error situation first

  1. Check if pure image would hit such issue or not

  2. Confirm if this issue is related to your patch.

Because this is filesystem corrupted, but yoru patch is related to v4l2. Sounds a little bit irrevalant.
That is why we need to clarify if this is really related to your kernel or even the pure image can hit issue.

There are a few more topics that report this issue. Do you really think the problem is on my side? Checking using pure image is something someone from Nvidia could do too, since there have been multiple reports of this happening.

Hi,

Yes, I know everything happened out there. The problem here is I already tried multiple times. But failed to reproduce that issue.

That is why I need to clarify the exact steps to reproduce this issue.

The issue probably happens if there is some file system interaction. My guess is that you stress-tested reboots, but didn’t do any file system operation in-between.

You could probably replicate this by writing a few hundred megabytes to the SD card in-between reboots.

Hi,

Could you also share your method based on the v4l2 patch?
We could try it again.

What do you mean?

I mean I need your kind help to share your exact usecase.

We would try your case and also other kinds of stress test too.

copy kernel Image, copy modules, copy dtbs → reboot → open qv4l2 app, check if camera has image, rinse and repeat. All disk operations involved are just copying over SSH. If you want the exact source code I can link it to you but I don’t think it makes a difference.

So basically it is a process to capture frame and stored into sdcard? or you don’t even save file to sd?

Some workaround for now

  1. at this moment, please also try other kind of sdcard.

  2. Please run sudo fsck -y /dev/mmcblk1p1 before the system shutdown.

I don’t store it.

1 Like

Can’t do that on the live system, because the partition is mounted (being the root partition).

Tried various brands. Sandisk, Kingston, Samsung. Happened with all.

1 Like

Here’s the result of running fsck on the sdcard on my host machine.

sudo fsck -y /dev/sdb1
fsck from util-linux 2.39.2
e2fsck 1.47.0 (5-Feb-2023)
/dev/sdb1 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inode 798929 seems to contain garbage.  Clear? yes

Inode 798930 seems to contain garbage.  Clear? yes

Inode 798931 seems to contain garbage.  Clear? yes

Inode 798932 seems to contain garbage.  Clear? yes

Inode 798933 seems to contain garbage.  Clear? yes

Inode 798934 seems to contain garbage.  Clear? yes

Inode 798935 seems to contain garbage.  Clear? yes

Inode 798936 seems to contain garbage.  Clear? yes

Inode 798937 seems to contain garbage.  Clear? yes

Inode 798938 seems to contain garbage.  Clear? yes

Inode 798939 seems to contain garbage.  Clear? yes

Inode 798940 seems to contain garbage.  Clear? yes

Inode 798941 seems to contain garbage.  Clear? yes

Inode 798942 seems to contain garbage.  Clear? yes

Inode 798943 seems to contain garbage.  Clear? yes

Inode 798944 seems to contain garbage.  Clear? yes

Pass 2: Checking directory structure
Entry 'nvphs' in /var/lib (786436) has deleted/unused inode 798936.  Clear? yes

Entry 'nvpmodel' in /var/lib (786436) has deleted/unused inode 798932.  Clear? yes

Entry 'NetworkManager-intern.conf' in /var/lib/NetworkManager (786920) has deleted/unused inode 798940.  Clear? yes

Entry 'dbfef1aa0b064bcf9d30ec3ad0886edb-device-volumes.tdb' in /var/lib/gdm3/.config/pulse (794656) has deleted/unused inode 798935.  Clear? yes

Entry 'config.dat-old' in /var/cache/debconf (798684) has deleted/unused inode 798931.  Clear? yes

Entry 'installer' in /var/log (798900) has deleted/unused inode 798942.  Clear? yes

Entry 'oem-config.log' in /var/log (798900) has deleted/unused inode 798944.  Clear? yes

Entry 'auth.log' in /var/log (798900) has deleted/unused inode 798930.  Clear? yes

Entry 'status' in /var/lib/nvfancontrol (798890) has deleted/unused inode 798937.  Clear? yes

Entry 'settings' in /var/lib/bluetooth/B4:8C:9D:34:D2:DA (798927) has deleted/unused inode 798929.  Clear? yes

Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Inode 786436 ref count is 72, should be 70.  Fix? yes

Inode 798900 ref count is 6, should be 5.  Fix? yes

Unattached inode 798945
Connect to /lost+found? yes

Inode 798945 ref count is 2, should be 1.  Fix? yes

Pass 5: Checking group summary information
Block bitmap differences:  -3154287 -3154422 -3154430 -3179011 -3179017 -(3179126--3179128) -3179148 -3179191 -3189283 -(3189666--3189667) -3189669 -3244156 -3244160 -3244279 -(3480896--3480942) -(3491515--3491564) -4230658 -(4230700--4230710) -(4230812--4230823) -(4259918--4259922) -(4260929--4260930) -4402688 -4405248 -14191693 -14191697
Fix? yes

Free blocks count wrong for group #96 (23993, counted=23996).
Fix? yes

Free blocks count wrong for group #97 (10316, counted=10327).
Fix? yes

Free blocks count wrong for group #99 (4877, counted=4880).
Fix? yes

Free blocks count wrong for group #106 (7908, counted=8005).
Fix? yes

Free blocks count wrong for group #129 (0, counted=24).
Fix? yes

Free blocks count wrong for group #130 (0, counted=7).
Fix? yes

Free blocks count wrong for group #134 (23670, counted=23672).
Fix? yes

Free blocks count wrong for group #433 (2194, counted=2196).
Fix? yes

Free blocks count wrong (26574840, counted=26574989).
Fix? yes

Inode bitmap differences:  -(798929--798937) -798940 -(798942--798944)
Fix? yes

Free inodes count wrong for group #97 (6578, counted=6591).
Fix? yes

Directories count wrong for group #97 (334, counted=331).
Fix? yes

Free inodes count wrong (7580445, counted=7580458).
Fix? yes


/dev/sdb1: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sdb1: 177366/7757824 files (0.1% non-contiguous), 4441971/31016960 blocks

In addition to a full serial console boot log (which is what @WayneWWW is asking for), I did see something of interest in one of the logs I have a question for you on:

Aug 31 23:57:51 ubuntu kernel: [  171.394958] EXT4-fs (mmcblk1p1): resizing filesystem from 14417920 to 31016960 blocks
Aug 31 23:57:51 ubuntu dhcpd[9136]: DHCPDISCOVER from 9e:3b:1b:13:2d:3a via l4tbr0: network 192.168.55.0/24: no free leases
Aug 31 23:57:52 ubuntu kernel: [  172.325350] EXT4-fs (mmcblk1p1): resized filesystem to 31016960
Aug 31 23:57:53 ubuntu nv-late-init.sh[9529]: Filesystem at /dev/mmcblk1p1 is mounted on /; on-line resizing required
Aug 31 23:57:53 ubuntu nv-late-init.sh[9529]: old_desc_blocks = 7, new_desc_blocks = 15
Aug 31 23:57:53 ubuntu nv-late-init.sh[9529]: The filesystem on /dev/mmcblk1p1 is now 31016960 (4k) blocks long.

This log would occur upon first boot after an installation. The SD card will try to resize only on the first boot whereby there is more space on the SD card. Is the corruption always after a flash? Or have a few reboots occurred prior to the corruption? I’m thinking this log was just showing that from some previous boots, in which case it isn’t relevant, but if this is from a recent flash, then it is relevant.

The other question is how was the SD card prepared? I’m assuming this is the rootfs (o/s) running on the SD card (there wouldn’t be any eMMC on an Orin Nano dev kit), but you could have generated this from flash software, or it could have been taken from a preexisting image which was put on the SD card. If the image this is taken from is itself corrupt, then then SD card and boot wouldn’t actually be the cause of corruption (one can loopback test an image to see if it is corrupt). If the image was generated, then it should still be on the Linux PC and that can be loopback tested. Should the images this is created from show as not corrupt, then the corruption has to be from the Jetson. Should the Jetson be the cause, then knowing if the corruption is tied to first boot (which is what the log excerpt is from, but it might not have been a recent boot), then the cause is different than if the corruption is from a later boot.

The goal of a serial console log is to catch what causes the corruption as well as when the boot detects corruption. If for example there is a software failure during a normal shutdown, that means the serial console will have to contain at least the shutdown content from the previous boot. One could boot up normally, if there is no corruption yet, start the serial console log, and reboot (which should then catch the corruption in boot stages which dmesg won’t show).

Incidentally, when things are working, what is the output of “df -H -T” and “lsblk -f”? If resizing failed I would expect a different result than if resizing has succeeded.

The corruption happens many reboots after the first flash.

SDK Manager.

I’ll try to get the serial console logs when it happens again.

Resizing doesn’t fail.

Here’s the serial console log with the previous boot and the boot that the sdcard appeared corrupted at.

uart.log (219.9 KB)