Jetson TX2i boot-loop

linus.s.gates · June 28, 2020, 10:59am

I used JetPack 4.2 fo flash my Jetson TX2i.
After a few minutes of normal working, I unplugged it and then reconnected the Jetson.
The kernel is booting and failing to mount the ext4 file-system in /dev/mmcblk0p1 (I’m not using U-BOOT).
I managed to clone this partition and it indeed contained a corrupted ext4 with bad CRC of the super block (what prevented the mount).
fsck.ext4 also indicates that:
Superblock needs_recovery flag is clear, but journal has data.
and when running fsck.ext4 on my PC with -y flag I get alot of:
Free blocks count wrong for group #XX
I could not reproduce this problem with other cards and I don’t know why it happened and how to prevent it (my system sometimes will experience some non-soft shutdowns).

linuxdev · June 28, 2020, 9:43pm

This of course is a bad way to shut down, and often, if there isn’t too much unwritten data, then the journal will recover (although you’ll be missing some content the ext4 filesystem won’t be corrupt). Then there is the case where there is more unwritten data than the journal can handle, or the missing data was important for boot.

Are you running “fsck.ext4” against a loopback device covering the cloned partition? The first thing to realize is that if you have a backup file to loopback repair, then you can make a copy of that file, and you are not at risk to do whatever testing you want on the copy of the clone. Do you have enough disk space on the host PC where you can create a copy of the clone?

If you are instead running this against an SD card or some other removable media, then the other media might be going bad. An issue with the superblock not writing may imply the wear leveling has reached the end of its life on a solid state memory type. This does not apply to loopback devices, it only applies to actual solid state devices. The actual eMMC of the device the clone came from would not be at issue. More information is needed on how the image is stored.

linus.s.gates · June 29, 2020, 5:57am

I cloned the APP partition using:
sudo ./flash.sh -r -k APP -G backup.img jetson-tx2i mmcblk0p1
and run the fsck.ext4 on backup.img on my host PC.
I know I can avoid this problem in many ways, like mounting the ext4 as read-only and mapping UDA and working from there.
The thing is that I’m trying to understand what caused this specific problem, since this is a new Jetson TX2i (guess the eMMC wear leveling is not in EOL) and there were no heavy writes done (the Jetson was kind of idle).
I want to understand why I got this bad CRC.
I know the superblock stores some meta-data of the file-system, yet I guess that on superblock change (for example due to change in free blocks count) the ext4 implementation calculates the new block and CRC and writes it entirely (since it is a NAND’s erase block).
So this error is unclear to me.

linuxdev · June 29, 2020, 6:32pm

I may just be confused by missing some steps, but you’d need to run fsck.ext4 on the loopback device covering the backup.img.raw (when I see “backup.img” I think of it as a sparse file). Can we verify if your fsck was against a loopback device covering the raw image instead of sparse image?

There are limits to what a journal can recover. I suppose there are ways to increase the journal size, but this would in no way prevent data loss (only corruption of the ext4 filesystem would be prevented), and power off would still risk both data loss and corruption if too much unwritten data is outstanding at the moment of power loss.

I could not tell you about the specific case of the TX2i’s superblock, but if we are looking at an SD card, then there is an implication that the memory used for the superblock was unable to be written correctly, and thus the other CRC checksum issues could not be corrected. If wear leveling on an SD card has reached the end of its abilities, then this is one reason why an SD card would fail with this.

Someone from NVIDIA with knowledge of the internal workings of the TX2i’s eMMC would need to comment on why there might be an unrecoverable superblock issue.

Bibek · July 9, 2020, 5:34am

Can you explain a bit on that? Are you talking about SDCARD?
We generally do sudden plugout of power. And we have not seen EMMC filesystem corruption.And very rarely we have heard of filesystem corruption. But it is possible. It has nothing to do with HW, or in particular TX2i.I will check with MMC experts if something more we can add to this info.

linus.s.gates · July 20, 2020, 6:37am

I am talking about the eMMC in the TX2i. Since we did nothing special with the Jetson that got corrupted, we just took another Jetson TX2i and tried to test it through many forced plugouts (flashed using the same JetPack).
We did it few seconds after power-up (so it would shutdown during boot) and also after 30-45-60 seconds (randomly) to shutdown after boot.
This was done over 20,000 times with no replication of the bug.

linuxdev · July 20, 2020, 7:14pm

Corruption would depend on the amount of unwritten data exceeding journal size/playback. A system which has not written data for some time before loss of power might not lose anything other than say a slight bit of a log file tail. A system which was writing significant data at the time of failure is more likely to corrupt. So it just depends on the nature of the journal size and the unwritten data size…try purposely writing several very large files simultaneously and then pulling power…this might result in requiring fsck.ext4 even after journal recovery. I’ve never done it, but tune2fs could in theory create a larger journal which would be more resilient.

linus.s.gates · July 21, 2020, 6:14am

I’ve done this tests of powerdowns when the Jetson is writing big files infinatly, no success in restoring the bug.
The Jetson that experienced the bug did not run any think like that anyway…

linuxdev · July 21, 2020, 7:39pm

The gist of this is that corruption from incorrect power down is not actually a bug. The journal is a safety tool with limitations, and is doing what it should. When the ability of the journal is exceeded, then the operating system is intended to refuse writing to that filesystem to avoid further corruption. Incorrect shutdown is the actual bug.

Topic		Replies	Views
Jetson Can't e2fsck mmc0blk1 Jetson TK1	6	2076	November 8, 2018
Is Jetson software protected for incorrect shut down? Jetson Orin Nano jetson	8	71	November 21, 2024
Tuning linux on jetson nano for better data reliability in power failure scenario Jetson Nano kernel	14	1771	June 7, 2023
Clone entire TX2 Jetson TX2	29	11303	October 18, 2021
Jetson Nano Board EXT4-fs error Jetson Nano linux	6	2502	March 21, 2022
How can I tell???? Newbie Jetson TX2	10	1056	May 8, 2017
best way to backup a TX2? Prefer image backup/restore Jetson TX2	13	11130	October 18, 2021
Jetson TX2 micro USB port is broken Jetson TX2	21	2468	October 18, 2021
filesystem error: ext4_journal_check_start:56 Jetson TX2	3	2555	October 18, 2021
Jetson TX2 Stuck in Restart after installing Ubuntu 18.04 Jetson TX2 boot	19	2084	October 18, 2021

Jetson TX2i boot-loop

Related topics