Nvidia AGX Xavier boot up problem

Note that a journal type filesystem keeps a record of what is written but not flushed, and that if you suddenly lose power or have a crash, then the journal will back out the changes which were not yet flushed. You’d lose all content which was in the middle of write at the time of loss or lockup, but the system would not be corrupt. Unfortunately, the journal has a fixed size, and if more content than it keeps a record of is being written, actual corruption occurs.

If the boot is unable to use a filesystem, then probably something invalid was actually written directly to the disk (e.g., formatting a disk of a running system would destroy it), or else there was a large amount of unflushed data at the time of failure.

In the case of no “corruption”, but “loss of data”, it implies that content was in the middle of write at the time of failure. In that case those files/directories would be missing or incomplete.

In the case of a boot failure when unable to load the Image file, if the root filesystem type is not one which the boot software understands, then this would be a failure. If the Image file was being updated at the time of a power loss, then it is possible that either the old kernel would still be in place, or that part of the new kernel would be lost (and thus corrupt, but still present). If enough of the kernel was being written at the time of failure, and the journal sees this, then the entire file might be erased during the journal recovery (versus just part of the file or versus rolling back to an old version).

If you have a need for emergency shutdown when something is locking you out there is a recommended way to shut this down instead of cutting power if you have a keyboard attached. The magic sysrq can sometimes be used to first call sync (preferably twice), then the filesystem remounted read-only, followed by either cutting power or being told to reboot. Sysrq usually survives even when the rest of the system is locked up or otherwise failing.

If you have a keyboard attached, then you might try this once just to see how it works:

  1. ALT-SYSRQ-s # Calling sync twice. Watch “dmesg --follow” ahead of time if curious.
  2. ALT-SYSRQ-s
  3. ALT-SYSRQ-u # Calling for the filesystems to be remounted read-only.
  4. ALT-SYSRQ-b # Calling for forced reboot.

If you have a working serial console you cannot use key bindings for this since it would go to the host PC instead, but you can use an “echo” of the correct character, and redirected to “/proc/sysrq-trigger”. Example, from serial console:
sudo echo 's' > /proc/sysrq-trigger # Would call sync.

Jetsons are full computers, not little embedded devices without cache or buffer. Treat them as if they are full computers. If you wouldn’t turn your host PC off by yanking the power cord from the wall, then don’t do this with a Jetson…it might be tiny, but it is a full system and would suffer the same as a desktop PC. Obviously when you hit bugs or crashes there isn’t much you can do about it, but if you have magic-sysrq available, then it is much safer than just pulling power.

Incidentally, there is a mask used to determine how much of magic-sysrq is exposed to the user. Not all architectures allow all functions, but basically a mask of “1” enables everything the architecture supports. See https://www.kernel.org/doc/Documentation/sysrq.txt, and examine “kernel.sysrq” in “/etc/sysctl.conf”. To see the actual sysrq mask currently being honored run this command:
cat /proc/sys/kernel/sysrq

It is possible that someone setting up a commercial release would want to disable part of the sysrq, but would suggest keeping at least the ability to shut down cleanly without yanking power.

1 Like