Jetson Orin NX not booting after multiple reboots

We performed some custom stress-tests on the Jetson which required us to reboot the device often in succession (say about 10-15 times, with maybe a minute between each reboot). Twice now while doing this, the device has stopped booting properly and gets stuck saying bash: cannot set terminal process group (-1): Inappropriate ioctl for device.

From this similar issue the only working solution is to reflash, but we would like to see if there are any better solutions to this, and maybe ways to prevent it ever happening. We do not intend to reboot the device as often as the stress-test, but looking around on other posts it seems it can happen with a lot less reboots.

We’re using an ORIN NX 16GB with the Seeedstudio reComputer Industrial J4012, and flashed it with with a custom basic rootfs using this guide.

Here are the serial console boot logs (for some reason it didn’t get the last logs, which is found in the missing log):

serial-debug.log (63.8 KB)
serial-debug-missing.log (10.5 KB)

Hi,

The same issue has been asked many times.
The protective mechanism of UEFI puts devices in recovery boot if it fails to boot for three times,
or even if the boot looks to be successful, it’s still not deemed a complete boot until this system service is up.

sudo systemctl status nv-l4t-bootloader-config

So please check the condition of this service each time you are about to reboot/shutdown the device.

1 Like

We noticed that the status nv-l4t-bootloader-config service was waiting for the rc-local.service. We had previously added a 60s sleep to the /etc/rc.local file following the instructions here

to run jetson_clocks on startup. This now seems like a very dangerous advice, given that you severely increase the chances of bricking your device if you apply this sleep

/etc/rc.local just runs command on startup if it is enabled. sleep will never brick a Jetson. In fact, actual bricking is nearly impossible since Jetsons don’t have a BIOS. The risk is the requirement to flash and install again, but you can clone a device for backup before doing anything risky.

The command you put in rc.local could be in the form of either a blocking command or a non-blocking command. If either command actually fails, boot would continue. If a non-blocking command never completes, boot would continue. If a blocking command blocks, and does not outright fail, then it could block the rest of rc.local. If user login is still running normally, or perhaps ssh, then this would still run and not be a risk. A command which brings down the system would of course be a problem, but it doesn’t matter if a command bringing down the system is in rc.local versus anywhere else; even running such a command manually still brings down the system, so it doesn’t matter where it is. What matters in this case is if you have an opportunity to interrupt a failed command. That depends on the command and issues.

Sleep is only blocking to calls within that script file for content appearing after that sleep time. Note that this will never be related to “inappropriate ioctl for device”. That error indicates an incorrect driver for the ioctl command being used to communicate with that driver.

You’ve likely heard many times how everything is a file in *NIX, and it is true that many drivers interface with device special files for file-like operations such as read or write to pass data. Things which cannot be passed as ordinary data require an IO control command, or ioctl. This are numerically indexed lists of commands which are custom to each driver. If someone issues a setup command via ioctl to a driver which does not understand that ioctl call, then you get an inappropriate ioctl for device (the device the driver is connected to).

One possibility for incorrect driver is a kernel customization which is using the wrong configuration. Considering you are using a third party carrier board, it is is also possible that your device tree is pointing the driver to the wrong lanes and the device intended for the driver has been lost; this also causes inappropriate ioctl for device (not because the driver is wrong, but because there is no device behind the driver to respond).

Do note that rc.local is the very last thing to boot when Linux has already loaded. When the Linux kernel loads the boot content is basically end of life and Linux replaces it. Failed boot in your log shows you got through boot, and there is a PCI bus error. rc.local was never run, it did not get that far. Apparently the PCI bus has issues that lock up.

One possible reason for PCI errors is using the wrong device tree for the carrier board. Hardware which cannot self-report (meaning it is not plug-n-play, including a lot of parts on the carrier board, part of which is the PCIe controller) must have its location and setup parameters passed to the driver as it loads. This is normally done via device tree (part of it might be via a module argument if in the form of a module). Are you absolutely certain you are using Seeed Studio’s device tree? Has the kernel ever been modified?

Incidentally, this is the module load which starts out initially succeeding, but then the serial log ends in the middle of:

insmod /lib/modules/5.15.122-tegra/kernel/drivers/pci/controller/dwc/pcie-tegra194.ko

Incidentally, since rc.local was never reached, your jetson_clocks was also never run; nor the sleep.

Interesting, and thanks for the detailed answer!

It was dumb of me not merging the two logs as it obviously lead to a misunderstanding, but the second “missing” log continues were the first one left off. So the log from insmod .. pcie-tegra194.ko continues there. At the end of the log it says it fails to mount the rootfs, (it first tries to mount /dev/nvme0n1p1 /mnt and then /dev/mmcblk3 /mnt.

Secondly this log is not from the when error first happened, I managed to reproduce the error and have a log from rebooting multiple times until it fails, see the attached file. You are right that this first happened to a Jetson where we used seeedstudios patch for l4t 35.3.1, but slightly modified for l4t 36.2 (as well as a custom basic rootfs). So this is not the exact same device tree as unmodified devices from Seeedstudio, since they do not plan on updating their patch for Jetpack 6 until the production release.

However, I managed to reproduce the error on an unmodified Jetson straight from Seeedstudio by just adding the sleep 60 to /etc/rc.local and rebooting 2-3 times (see the attached file). Before adding the sleep i could reboot > 10 times without anything wrong happening.

When it starts to fail we see L4TLauncher: Attempting Recovery Boot, instead of L4TLauncher: Attempting Direct Boot.

spam-reboot-out-of-box-serial.log (280.9 KB)

Edit:
Removed the /etc/rc.local file and added a service with jetson_clocks (no sleep) that runs After=nv-l4t-bootloader-config.service. Cannot reproduce the booting error now, even on our custom rootfs and device tree for l4t 36.2

Seeed Studio would have to provide the boot chain software since it is their carrier board. I don’t know what the customizations are, but rc.local was not part of the boot issue. It might help to explain something about an initrd before saying more.

Whenever boot stages hand off to a kernel, the boot stages themselves have to be able to retrieve whatever software starts the Linux running. This would at minimum include the kernel, and probably device tree and arguments to pass to the kernel. The kernel itself might have more requirements. An example of the kernel having further requirements is that if the system runs on RAID, and if the RAID drivers are in the form of a module, then the modules cannot be on the RAID volume…a bit of the classical “which came first, the chicken or the egg?” proverb. The boot chains understand ext4. They also understand a RAM disk, which is just a very simple filesystem which exists as a tree structure in RAM. The content which fills the RAM disk is a “cpio archive” (basically a simple serialize/deserialize backup and restore mechanism).

During a normal boot one might load the kernel directly. This works great if everything is on the initial media, and if that media is all ext4. However, if you get the kernel from the eMMC “/boot”, and then tell it the rest of the o/s is on external media, e.g., an NVMe, then suddenly the kernel is missing all of its modules if it is missing the mechanism to drive an NVMe.

The initrd will be used initially during initrd boot instead of the filesystem. A cpio archive is unpacked into RAM, and this contains everything the kernel needs for a very minimal boot. It also contains the device tree and kernel modules. For example, if you had an audio module, but it wasn’t needed for boot, then it wouldn’t be in the initrd; if you had a driver for accessing an NVMe which is not part of the main kernel Image, then that module would be part of the initrd. Instead of a login shell at the end of the initrd, it performs a pivot_root or equivalent which transplants the final rootfs in place of the cpio archive; the cpio archive no longer exists, and the Linux kernel neither knows nor cares because it has another rootfs now.

Your initrd boot is the stage where something is going wrong. The initrd is failing to find these devices:

Finding OTA work dir on external storage devices

Checking whether device /dev/mmcblk?p1 exist

Device /dev/mmcblk?p1 does not exist

Checking whether device /dev/sd?1 exist

Device /dev/sd?1 does not exist

Checking whether device /dev/nvme?n1p1 exist

Looking for OTA work directory on the device(s): /dev/nvme0n1p1

[    6.907874] EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: (null)
OTA work directory /mnt/ota_work is not found on /dev/nvme0n1p1

Finding OTA work dir on internal storage device

mount: /mnt: special device /dev/mmcblk0p1 does not exist.

Failed to mount /dev/mmcblk0p1 to /mnt

OTA work directory is not found on internal and external storage devices

There are no devices so far as the initrd is concerned. This is why bash cannot set a terminal process…bash is what runs all of those initrd commands for setting things up within the cpio archive. It’s trying to set up the real rootfs and it can’t see it. It fails to pivot_root because there is nothing to pivot to. That’s the inappropriate IOCTL for device. The system call for a driver that would pivot to a new root is receiving an impossible command to change to missing hardware.

I don’t know if you have a way to analyze your final rootfs. For example, clone it to another computer, or mount it read-only on another computer (a raw clone can be loopback mounted read-only). You would have to figure out if it is the filesystem causing the failure. If not, then you’d have to examine the cpio archive of the initrd and figure out if it does not always run, but does run in this case, and might be missing a required driver (or many) which causes it to otherwise fail to find an otherwise valid filesystem.

One reason why an initrd might fail is if the kernel has modules needed for boot, and the modules are not present. If you’ve ever updated the kernel such that it has other boot requirements in the form of a module, and you failed to put the module in the initrd, then so long as the initrd is not triggered you won’t see the issue; as soon as something causes it to initrd boot, then the missing module would cause failure to find the media. This is an interesting possibility in your case because maybe the normal boot isn’t via initrd until the third failure. Or maybe it always runs the initrd, but under the two circumstances different branches of init (which is the bash script of the cpio archive) might mean it boots fine in one branch, but fails in the other.

Summary:

  • The filesystems are bad. A clone and examination could say more, but do so always in read-only mode to avoid the host changing or fixing it.
  • The detection of filesystems fails. You’d have to closely examine logs and unpack the cpio archive of the initrd (which is basically command use of gunzip and cpio; it is simple).
1 Like

This isn’t really a big issue now with the fix mentioned above, though we will continue to monitor it closely.

I understand that rc.local was not the direct cause of the boot issue, but the sleep in rc.local did in fact delay the status nv-l4t-bootloader-config service mentioned by @DaveYYY such that we rebooted before this service could finish successfully. Which in turn seemed highly correlated with the booting issue, as it was very reproducible. Edit: Or what you might mean is that it should not fail to boot even when the “Recovery boot” triggers?

We do not have the time or resources to further examine logs and filesystems ourselves, though I will make sure to pass along this good and detailed (again, thank you for that) information to Seeedstudio and see if they have anything to add.

Just a reference note in case you come back to this at a later date: When using a different partition it is possible that content within the alternate boot differs from content in the primary partition. This, in turn, implies that things like kernel arguments, device tree, udev triggers, so on, might behave differently depending on which rootfs was booted, or depending on which kernel or device tree is used (it might use the same one, but it might use something different).

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.