Orin NX unable to boot after days in fully functional operation

Hello

We are currently working on a project, where we are using Jetson Orin NX 16GB modules on a custom carrier board.

The Jetson is using NVMe 1TB SSD as storage and after implementing our kernel customizations in the Linux for Tegra sources, flashing is successful, the Jetson is booting normally and is fully operational.

After setting up our software (basically only ROS2 in the ISAAC ROS docker) the Jetson was in Operation for several days, for multiple system restarts and even power cut offs.
We now have encountered a problem multiple times, where after some time with normal operation, the Jetson all of a sudden does not bootup to Linux anymore. It tries several times and then tries do boot to recovery mode, but once this happens we do not get it to work again without reflashing and setting it up completely fresh.

Note that we are using a Intel E810 25Gbit network card on a PCIe x4 interface. The NVMe is connected with PCIe x2.

We are clueless on what could be the cause since we do not make any changes to the Jetson during operation…

Here is the Kernel log extracted via Debug UART:
Jetson_stuck_2025-03-05.txt (76.6 KB)

Here is what it looks like on the monitor during failed bootup:

Thanks in advance!

Just to clarify the situaiton.

  1. Your system got stuck in recovery boot. It is a image that will got entered if your system got boot failure multiple times.

  2. To recover from it, please refer to
    Jetson AGX Orin FAQ

  3. Recover from it does not mean the issue is fixed. As I said in (1), we need to check why your board got multiple boot failure. So you may still monitor your uart log and see what is going on

  4. Many users didn’t get what I said in above, so if you also cannot understand, please tell. Don’t waste time trying something when you don’t get it.

Hello @WayneWWW

Thank you for your quick reply.

I understand what you are saying.
In the bios, I changed Device Manager → NVIDIA Configuration → L4T Configuration → OS chain A status back Normal to force “Direct Boot”

As already mentioned in my initial post, it does not boot up after all, but tries (and fails) to reboot 3 Times and then falls back in the recovery mode.

The log of the recovery mode boot is the one i already attached in my initial post.

Here is the log from the direct boot attempts:
log_direct_boot.log (93.1 KB)

Thanks for your help

HI,

Yes, the direct boot attempt is the one I want to check.

Based on this, I can see the file system is not able to get mounted.

[ 3.400738] usb 1-2.5: new high-speed USB device number 5 using tegra-xusb
[ 12.996716] ERROR: mounting PARTUUID=12659b25-41dd-45c8-b011-327ab214ef6f as /mnt fail…
[ 12.999177] ERROR: PARTUUID=12659b25-41dd-45c8-b011-327ab214ef6f mount fail…
[ 13.000707] ttyTCU0: Press [ENTER] to start bash in 30 seconds…

Did you modify anything like kernel or device tree on your side? From what I saw is the PCIe driver is not even appeared in your log.

Upon first setup (directly after flashing), we added a device tree overlay to /boot/extlinux/extlinux.conf.
This was working fine since the hardware added with the overlay was fully operational and the Jetson was in normal use for about two weeks until this boot-up problem first occured.

What we have additionally done is just adding the cuda paths to .bashrc to find nvcc as compiler since this is not done by default upon installing nvidia-jetpack.

From my point of view, none of these changes should result in a complete bricking of the system…

Regarding the PCIe driver: The Intel Ice Driver is built on the target and it generally takes up to two minutes until its loaded and the interfaces are visible. So it makes sense that it is not visible yet during this boot-up stage.

I am not quite sure about what Intel Ice driver you mentioned here. That is NOT what the problem comes.

I am talking about the tegra pcie drivers. If no such driver is up, then none of the PCIe would come.

Ah okay.

Misunderstanding here…

What i don’t understand is why mounting fails if at line 1574 it says:

[ 1.959367] Root device found: PARTUUID=12659b25-41dd-45c8-b011-327ab214ef6f

That UUID is a ID for your partition on the NVMe. And that is where the rootfs got mounted.

But the NVMe is not up because the PCIe driver is totally not present. That is what this error is talking about.

As for why the pcie driver is not able to up, this could be various of reasons. From what I saw in most of time, most likely due to you updated the kernel and make the kernel version and kernel modules version are not matching to each other. And PCIe drivers are part of kernel modules (.ko) so it fail to get probed.

If you don’t know what I am talking about, try to reflash your board to the initial state and it shall get the driver loaded back again.

I understand what you are saying.

WE DID NOT DO any changes to kernel and we did not do any updates or changes on the Jetson either.

Reflashing is not a solution. The Jetson is supposed to be up and running in the field in an enclosure not easily accessible.
Your proposal of just reflashing is not the way to go. This in our case means reflashing the Jetson every two weeks.

This is not a solution, but only a sketchy workaround.

Actually what I am talking about is more like a debug process but not providing you a solution.

You could go back doing the flash and restart the steps you’ve done and see which one is causing this problem.

This topic won’t get progress if you just keep talking about “you didn’t do something” or " I don’t think that is the cause". You need to at least try something there so that we can see why some steps there making the PCIe driver gone. Please be aware that you are the one who can operate that board. I am nobody but just a stranger here that to provide you some suggestsions.

If you don’t want to flash, then maybe try to boot into initrd first and use that console to check if kernel modules are missing. To enter initrd, it is similar to what you did in Device Manager → NVIDIA Configuration. I remember there is one boot option is initrd.

I completely understand your point, but please understand my frustration.

We have encountered the problem not only once, but already three times.
I have reflashed the Jetson three times and set up everything from scratch. After none of my steps it does stop working. It only stops working after some time in normal operation (once it was working for 1.5 months, once it was working only a couple of days and the third time (now) it was working for about two weeks).

You see my point? The problem is not easily reproducable, since it does not follow the scheme
Cause → Effect. It is happening randomly…

I leave this Jetson as it is to have some sort of “Patient 0”.

I have already set up a new Jetson in the same way (which is working fine at the moment)
Could you instruct me on how to check if pcie modules have a missmatch?
My basic setup is made up from the following:

  1. Set Nvpmodel to MAXN and install a service which is executing “sudo jetson_clocks” after bootup.
  2. Install nvidia-jetpack
  3. Install the “ice” driver ver. 1.13.7 for my PCIe Intel E810 25Gbit network card from https://www.intel.com/content/www/us/en/download/19630/812404/intel-network-adapter-driver-for-e810-series-devices-under-linux.html
  4. Install docker + prerequisites and pull my Nvidia Isaac ROS dockers.
  5. Run these dockers (ROS2 streaming pipelines for Intel RealSense and ZED SDK cameras)

Hi,

Let me align with your for some information first.
What could you try on your device at this moment? Do you want to check on the board which is already stuck in error situation? or you want to start to check from fresh-flashed device first ?

I leave the fresh setup jetson as it is to see if it stops working after some time.

On the “bricked” one, I was able to boot into linux upon changing L4T Boot Mode from “Extlinux” to “Kernel Partition”

I can send you a log in an hour or so.

Hi,

If changing from “extlinux” to “kernel partition” can make things work, then it actually indicates that previously you accidentally changed the kernel.

extlinux will read the kernel from file system /boot/Image.
kernel partition will read kernel from separate partition which is not same as /boot/Image. If you didn’t change anything to kernel partition before, then the content in it should be the default image provided by us.

If I have done some kernel customization (i.e. adding drivers):
Is then the kernel partition image still “nvidia stock” or is it (directly after flashing) the same as in /boot/Imgage, so the one with the customization?

If your so-called “adding drivers” only happen on the Jetson device itself but not prepare on host PC and reflash the whole board, then it shall only affect the default kernel which is from /boot/Image and it won’t touch the kernel partition.

So kernel partition is most likely still the NVIDIA stock kernel.

I actually do kernel customization on my Host with cross-compiling the Linux for Tegra.
IN this case both Images would have the adaptions?

Yes, if that is the case, then they should be same.

Could you clarify what did you do in kernel customization and also what did you do in that “Install the “ice” driver ver. 1.13.7 for my PCIe Intel E810 25Gbit card”

In Kernel customization on the host, I add a custom driver for LT6911.
(This would be reflected in both extlinux and kernel image.)

However on the target, i am installing the Ice driver (this could then be the difference between extlinux and kernel image).
What i do here is just compiling the driver from the Intel Website as they documented in the readme with
cd <path_to_driver>/src
sudo make install

I think this indeed does some kernel customization by installing a .ko module, calling modprobe and initramfs and so on.
If this is the cause of the “corrupted” kernel, fine. But nonetheless I still can’t explain to myself why it keeps working after installation and only stops working randomly some time later …