Nvidia AGX Xavier boot up problem

Xavier board was working perfectly fine(SDK and everything was present).

Once I did a forced shutdown, now the board is not booting up.

How can I check at the issue which is causing this.

Dump the log from serial console.

Have attached the log
JetsonAGX_log (39.4 KB)

It looks like the file system is getting corrupted so that bootloader is not able to read the kernel from it.

[0006.407] I> ########## Fixed storage boot ##########
[0006.412] I> Already published: 00010003
[0006.416] I> Look for boot partition
[0006.419] I> Fallback: assuming 0th partition is boot partition
[0006.425] I> Detect filesystem
[0006.452] I> Loading extlinux.conf 

[0006.452] I> rootfs path: /sdmmc_user/boot/extlinux/extlinux.conf
[0006.489] I> L4T boot options
[0006.489] I> [1]: “primary kernel”
[0006.490] I> Enter choice:
[0009.491] I> Continuing with default option: 1
[0009.491] I> Loading kernel sig file from rootfs 

[0009.491] I> rootfs path: /sdmmc_user/boot/Image.sig
[0009.510] I> Loading kernel binary from rootfs 

**[0009.510] I> rootfs path: /sdmmc_user/boot/Image **
[0015.746] I> lookup_linear_dir:441: Invalid file block num
[0015.746] I> ext2_walk:142: ‘Image’ lookup failed
[0015.747] I> ext4_open_file:647: ‘/boot/Image’ lookup failed
[0015.747] E> file /sdmmc_user/boot/Image open failed!!
[0015.747] W> Failed to load kernel binary from rootfs (err=20

What should be the approach for this issue?

The only way now is to reflash your device.

Can I know what causes this issue? so that can take care about it in future.

I don’t know either. Since you said you did a “forced shutdown”, I can only guess that corrupted the file system.
Though I actually not saw much of such case before.

If you have more detail about what you’ve tried and able to reproduce this easily, then we can investigate.

Was having issue in connecting internet through USB, thus had done force shutdown.

I would suggest you can try the same thing after you re-flash the board and see if this error would happen again.

Sure, Thankyou

Will upgrading bootloader fix this issue? since reflashing the device might take time.

No, the broken part is in kernel. Bootloader update does not update it.

Also, actually not only the kernel is broken. We have a redundancy mechanism, when kernel in the file system is broken, it will fallback to kernel in the partition. Your kernel is fine in that partition.

But the ramdisk in the file system is broken too. And this one has no backup. If you just want a quick fix, you can try to use flash.sh with -I parameters and flash your initrd. Though I don’t guarentee it will work.

    -I <initrd> ---------- initrd file. Null initrd is default.

[0016.655] I> rootfs path: /sdmmc_user/boot/initrd
[0022.869] I> lookup_linear_dir:441: Invalid file block num
[0022.869] I> ext2_walk:142: ‘initrd’ lookup failed
[0022.870] I> ext4_open_file:647: ‘/boot/initrd’ lookup failed
[0022.870] E> file /sdmmc_user/boot/initrd open failed!!
[0022.871] E> kernel boot failed

Thankyou!

Note that a journal type filesystem keeps a record of what is written but not flushed, and that if you suddenly lose power or have a crash, then the journal will back out the changes which were not yet flushed. You’d lose all content which was in the middle of write at the time of loss or lockup, but the system would not be corrupt. Unfortunately, the journal has a fixed size, and if more content than it keeps a record of is being written, actual corruption occurs.

If the boot is unable to use a filesystem, then probably something invalid was actually written directly to the disk (e.g., formatting a disk of a running system would destroy it), or else there was a large amount of unflushed data at the time of failure.

In the case of no “corruption”, but “loss of data”, it implies that content was in the middle of write at the time of failure. In that case those files/directories would be missing or incomplete.

In the case of a boot failure when unable to load the Image file, if the root filesystem type is not one which the boot software understands, then this would be a failure. If the Image file was being updated at the time of a power loss, then it is possible that either the old kernel would still be in place, or that part of the new kernel would be lost (and thus corrupt, but still present). If enough of the kernel was being written at the time of failure, and the journal sees this, then the entire file might be erased during the journal recovery (versus just part of the file or versus rolling back to an old version).

If you have a need for emergency shutdown when something is locking you out there is a recommended way to shut this down instead of cutting power if you have a keyboard attached. The magic sysrq can sometimes be used to first call sync (preferably twice), then the filesystem remounted read-only, followed by either cutting power or being told to reboot. Sysrq usually survives even when the rest of the system is locked up or otherwise failing.

If you have a keyboard attached, then you might try this once just to see how it works:

  1. ALT-SYSRQ-s # Calling sync twice. Watch “dmesg --follow” ahead of time if curious.
  2. ALT-SYSRQ-s
  3. ALT-SYSRQ-u # Calling for the filesystems to be remounted read-only.
  4. ALT-SYSRQ-b # Calling for forced reboot.

If you have a working serial console you cannot use key bindings for this since it would go to the host PC instead, but you can use an “echo” of the correct character, and redirected to “/proc/sysrq-trigger”. Example, from serial console:
sudo echo 's' > /proc/sysrq-trigger # Would call sync.

Jetsons are full computers, not little embedded devices without cache or buffer. Treat them as if they are full computers. If you wouldn’t turn your host PC off by yanking the power cord from the wall, then don’t do this with a Jetson
it might be tiny, but it is a full system and would suffer the same as a desktop PC. Obviously when you hit bugs or crashes there isn’t much you can do about it, but if you have magic-sysrq available, then it is much safer than just pulling power.

Incidentally, there is a mask used to determine how much of magic-sysrq is exposed to the user. Not all architectures allow all functions, but basically a mask of “1” enables everything the architecture supports. See https://www.kernel.org/doc/Documentation/sysrq.txt, and examine “kernel.sysrq” in “/etc/sysctl.conf”. To see the actual sysrq mask currently being honored run this command:
cat /proc/sys/kernel/sysrq

It is possible that someone setting up a commercial release would want to disable part of the sysrq, but would suggest keeping at least the ability to shut down cleanly without yanking power.

1 Like

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.