Jetson TX2i boot-loop

I used JetPack 4.2 fo flash my Jetson TX2i.
After a few minutes of normal working, I unplugged it and then reconnected the Jetson.
The kernel is booting and failing to mount the ext4 file-system in /dev/mmcblk0p1 (I’m not using U-BOOT).
I managed to clone this partition and it indeed contained a corrupted ext4 with bad CRC of the super block (what prevented the mount).
fsck.ext4 also indicates that:
Superblock needs_recovery flag is clear, but journal has data.
and when running fsck.ext4 on my PC with -y flag I get alot of:
Free blocks count wrong for group #XX
I could not reproduce this problem with other cards and I don’t know why it happened and how to prevent it (my system sometimes will experience some non-soft shutdowns).

This of course is a bad way to shut down, and often, if there isn’t too much unwritten data, then the journal will recover (although you’ll be missing some content the ext4 filesystem won’t be corrupt). Then there is the case where there is more unwritten data than the journal can handle, or the missing data was important for boot.

Are you running “fsck.ext4” against a loopback device covering the cloned partition? The first thing to realize is that if you have a backup file to loopback repair, then you can make a copy of that file, and you are not at risk to do whatever testing you want on the copy of the clone. Do you have enough disk space on the host PC where you can create a copy of the clone?

If you are instead running this against an SD card or some other removable media, then the other media might be going bad. An issue with the superblock not writing may imply the wear leveling has reached the end of its life on a solid state memory type. This does not apply to loopback devices, it only applies to actual solid state devices. The actual eMMC of the device the clone came from would not be at issue. More information is needed on how the image is stored.

I cloned the APP partition using:
sudo ./flash.sh -r -k APP -G backup.img jetson-tx2i mmcblk0p1
and run the fsck.ext4 on backup.img on my host PC.
I know I can avoid this problem in many ways, like mounting the ext4 as read-only and mapping UDA and working from there.
The thing is that I’m trying to understand what caused this specific problem, since this is a new Jetson TX2i (guess the eMMC wear leveling is not in EOL) and there were no heavy writes done (the Jetson was kind of idle).
I want to understand why I got this bad CRC.
I know the superblock stores some meta-data of the file-system, yet I guess that on superblock change (for example due to change in free blocks count) the ext4 implementation calculates the new block and CRC and writes it entirely (since it is a NAND’s erase block).
So this error is unclear to me.

I may just be confused by missing some steps, but you’d need to run fsck.ext4 on the loopback device covering the backup.img.raw (when I see “backup.img” I think of it as a sparse file). Can we verify if your fsck was against a loopback device covering the raw image instead of sparse image?

There are limits to what a journal can recover. I suppose there are ways to increase the journal size, but this would in no way prevent data loss (only corruption of the ext4 filesystem would be prevented), and power off would still risk both data loss and corruption if too much unwritten data is outstanding at the moment of power loss.

I could not tell you about the specific case of the TX2i’s superblock, but if we are looking at an SD card, then there is an implication that the memory used for the superblock was unable to be written correctly, and thus the other CRC checksum issues could not be corrected. If wear leveling on an SD card has reached the end of its abilities, then this is one reason why an SD card would fail with this.

Someone from NVIDIA with knowledge of the internal workings of the TX2i’s eMMC would need to comment on why there might be an unrecoverable superblock issue.

Can you explain a bit on that? Are you talking about SDCARD?
We generally do sudden plugout of power. And we have not seen EMMC filesystem corruption.And very rarely we have heard of filesystem corruption. But it is possible. It has nothing to do with HW, or in particular TX2i.I will check with MMC experts if something more we can add to this info.

I am talking about the eMMC in the TX2i. Since we did nothing special with the Jetson that got corrupted, we just took another Jetson TX2i and tried to test it through many forced plugouts (flashed using the same JetPack).
We did it few seconds after power-up (so it would shutdown during boot) and also after 30-45-60 seconds (randomly) to shutdown after boot.
This was done over 20,000 times with no replication of the bug.

Corruption would depend on the amount of unwritten data exceeding journal size/playback. A system which has not written data for some time before loss of power might not lose anything other than say a slight bit of a log file tail. A system which was writing significant data at the time of failure is more likely to corrupt. So it just depends on the nature of the journal size and the unwritten data size…try purposely writing several very large files simultaneously and then pulling power…this might result in requiring fsck.ext4 even after journal recovery. I’ve never done it, but tune2fs could in theory create a larger journal which would be more resilient.

I’ve done this tests of powerdowns when the Jetson is writing big files infinatly, no success in restoring the bug.
The Jetson that experienced the bug did not run any think like that anyway…

The gist of this is that corruption from incorrect power down is not actually a bug. The journal is a safety tool with limitations, and is doing what it should. When the ability of the journal is exceeded, then the operating system is intended to refuse writing to that filesystem to avoid further corruption. Incorrect shutdown is the actual bug.