The AGX system often experiences automatic crashes within 3-4 months of operation

20230718.txt (1.7 MB)
20230719082112_COM36.txt (92.6 KB)
The system crash log is attached as BSP32.1 version. Please take a look at the cause

Hi father,

Are you using the devkit or custom board for AGX Xavier ?

This is quite an old release. Could you help to verify with latest R32.7.4 or R35.3.1?

[   17.668125] VFS: Cannot open root device "mmcblk0p1" or unknown-block(179,1): error -5
[   17.669176] Please append a correct "root=" boot option; here are the available partitions:
[   18.047825] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(179,1)
[   18.127800] Kernel Offset: disabled
[   18.131030] Memory Limit: none
[   18.141958] Rebooting in 5 seconds..
????Shutdown state requested 1
Rebooting system ...

It seems your board could not mount rootfs on mmcblk0p1 and trigger kernel panic after boot up.

Do you connect any other boot device like USB storage/NVMe SSD/SD card?
Or there’s any error during flash the board?

Okay, thank you very much for your reply. Each version has had this issue, but other devices have a high data throughput, so this phenomenon rarely occurs. This throughput is high, so it occurs very frequently

Incidentally, this is filesystem corruption:

[   17.545010] EXT4-fs (mmcblk0p1): error loading journal
[   17.549972] VFS: Cannot open root device "mmcblk0p1" or unknown-block(179,1): error -5
[   17.551057] Please append a correct "root=" boot option; here are the available partitions:
[  100.333856] EXT4-fs error (device mmcblk0p1): ext4_iget:4591: inode #1456345: comm systemd-tmpfile: checksum invalid

A crash itself could cause corruption. However, this would normally not be so much corruption that the journal could not recover. Is the system normally shutdown via software, and from loss of power?

Previously, it was indeed a direct power outage, but later it was set to software shutdown and then power down, and still remains the same

It is possible that too much damage was done during the power outage. Any “repair” would basically cut out content of the filesystem. You can’t really say where that missing content would be from in most cases, but if you could boot, or clone and examine the clone, looking at the “lost+found/” subdirectory might offer hints. When content is truncated during a repair of corruption, this is where the excised content goes.

Note that filesystems are a tree structure. You might think that if a file is being written, then only the file would corrupt. However, that file is a content within a directory. The directory itself could be considered a “special type of file” which contains files. That directory is part of another directory (unless it is “/”). You can’t predict with any certainty exactly what would be corrupt or removed for repair. If there is sufficient unwritten and unflushed content, then repair might be needed, and without that, the kernel would refuse to mount the filesystem (knowing that further writes of a corrupt filesystem will further corrupt). Often, at that stage, you must manually repair the filesystem, which is when you get excised content going to “lost+found/”.

The filesystem might have nothing to do with the problems. Perhaps the power spike caused damage, perhaps not. But I don’t think you can reliably solve this when it is corrupt without flashing again. Note that you could clone this, and examine the clone even if you can’t examine the Jetson directly. This also has the advantage that you can use loopback to cover the clone, and run repair tools on the loopback device to see what would happen.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.