Jetson TX2i boot-loop

I used JetPack 4.2 fo flash my Jetson TX2i.
After a few minutes of normal working, I unplugged it and then reconnected the Jetson.
The kernel is booting and failing to mount the ext4 file-system in /dev/mmcblk0p1 (I’m not using U-BOOT).
I managed to clone this partition and it indeed contained a corrupted ext4 with bad CRC of the super block (what prevented the mount).
fsck.ext4 also indicates that:
Superblock needs_recovery flag is clear, but journal has data.
and when running fsck.ext4 on my PC with -y flag I get alot of:
Free blocks count wrong for group #XX
I could not reproduce this problem with other cards and I don’t know why it happened and how to prevent it (my system sometimes will experience some non-soft shutdowns).

This of course is a bad way to shut down, and often, if there isn’t too much unwritten data, then the journal will recover (although you’ll be missing some content the ext4 filesystem won’t be corrupt). Then there is the case where there is more unwritten data than the journal can handle, or the missing data was important for boot.

Are you running “fsck.ext4” against a loopback device covering the cloned partition? The first thing to realize is that if you have a backup file to loopback repair, then you can make a copy of that file, and you are not at risk to do whatever testing you want on the copy of the clone. Do you have enough disk space on the host PC where you can create a copy of the clone?

If you are instead running this against an SD card or some other removable media, then the other media might be going bad. An issue with the superblock not writing may imply the wear leveling has reached the end of its life on a solid state memory type. This does not apply to loopback devices, it only applies to actual solid state devices. The actual eMMC of the device the clone came from would not be at issue. More information is needed on how the image is stored.

I cloned the APP partition using:
sudo ./flash.sh -r -k APP -G backup.img jetson-tx2i mmcblk0p1
and run the fsck.ext4 on backup.img on my host PC.
I know I can avoid this problem in many ways, like mounting the ext4 as read-only and mapping UDA and working from there.
The thing is that I’m trying to understand what caused this specific problem, since this is a new Jetson TX2i (guess the eMMC wear leveling is not in EOL) and there were no heavy writes done (the Jetson was kind of idle).
I want to understand why I got this bad CRC.
I know the superblock stores some meta-data of the file-system, yet I guess that on superblock change (for example due to change in free blocks count) the ext4 implementation calculates the new block and CRC and writes it entirely (since it is a NAND’s erase block).
So this error is unclear to me.

I may just be confused by missing some steps, but you’d need to run fsck.ext4 on the loopback device covering the backup.img.raw (when I see “backup.img” I think of it as a sparse file). Can we verify if your fsck was against a loopback device covering the raw image instead of sparse image?

There are limits to what a journal can recover. I suppose there are ways to increase the journal size, but this would in no way prevent data loss (only corruption of the ext4 filesystem would be prevented), and power off would still risk both data loss and corruption if too much unwritten data is outstanding at the moment of power loss.

I could not tell you about the specific case of the TX2i’s superblock, but if we are looking at an SD card, then there is an implication that the memory used for the superblock was unable to be written correctly, and thus the other CRC checksum issues could not be corrected. If wear leveling on an SD card has reached the end of its abilities, then this is one reason why an SD card would fail with this.

Someone from NVIDIA with knowledge of the internal workings of the TX2i’s eMMC would need to comment on why there might be an unrecoverable superblock issue.