Jetson AGX Xavier Intermittent Booting Issue

7/28/2023

NVIDIA Team,

We are having an intermittent booting issue with a Jetson AGX Xavier attached to a Connect Tech Rogue Carrier Board.

The issue is as follows:

Generally speaking, for days or weeks at a time we use our Jetson AGX Xavier + Connect Tech Rogue Carrier Board systems without fail. During this time, we install typical items (python packages, other benign software, etc.) but we do not modify the kernel or boot process in any way. We typically go through hundreds of power cycles and power the boards with multiple different power systems (different types of batteries, power supplies, etc.) none of which produce any problems.

However, after some time of using the systems (~1-2 weeks), upon turning the system on the system does not boot. At this point we can cycle power repeatedly but are not able to get the system booting again. We are able to connect to the serial console via Minicom, which produces the following log (minicom_fail.cap):
minicom_fail.cap (10.5 KB)

After the point in the logs where we see multiple blank lines in a row, the system hangs and we are never able to make it past this point. We are not able to access the booting options (i.e. Esc or F11 during boot) as our issue occurs before that point in the booting process.

We are, however, able to place the system into Force Recovery Mode and flash it, which works as intended and allows us to continue our work. After doing this we clone all of our code, install everything we need, operate the system without issue for 1-2 weeks before we eventually run into the same issue described above.

Some things of note:

  • We have many different Jetson AGX Xaviers and many different Connect Tech Rogue Carrier boards, this same issue has happened on all of them at some point
  • After obtaining a system in the broken state described above, we have removed the Rogue Carrier board from the Jetson AGX Xavier and swapped the modules with a known working combination at which point we see the issue “following” the Jetson
    • Ex.
    • Setup 1: Jetson AGX Xavier 1 + Rogue 1 = Broken
    • Setup 2: Jetson AGX Xavier 2 + Rogue 2 = Working as intended
    • Setup 3: Jetson AGX Xavier 1 + Rogue 2 = Broken
    • Setup 4: Jetson AGX Xavier 2 + Rogue 1 = Working as intended
      • Here is a serial log from the (working) boot process with this specific combo of items used (minicom_combo.cap):
        minicom_combo.cap (71.8 KB)
  • We have also seen a similar, but different, issue where everything mentioned above still applies except rather than hanging, the system automatically attempts to reboot itself in a repeated “reboot loop”. The log for the “reboot loop” can be seen here (minicom_fail_old.cap):
    minicom_old_fail.cap (877.7 KB)
    • We do not have this specific issue with any systems right now (we instead have the “hanging” issue), but I just wanted to mention this as it feels related.

Thank you for your help and please let me know if you have any questions!

Hi Ross,

May I know if there’s any test running or modification on the board been made in this period might cause the boot issue?
Or you think it just can’t boot occasionally?

Which Jetpack release are you using?
From some of your fail logs, it seems R35.2.1.

Jetson UEFI firmware (version 2.1-32413640 built on 2023-01-24T23:12:27+00:00)

From your combo log with boot success, it seems R35.3.1.

Jetson UEFI firmware (version 3.1-32827747 built on 2023-03-19T14:56:32+00:00)

Could you help to clarify if you are flashing different Jetpack release?
Or you flash the R35.3.1 at first, but you update the UEFI binary (back to R35.2.1) at the some time during this 2 weeks and cause the boot issue.

F11ASSERT [TerminalDxe] /out/nvidia/bootloader/uefi/Jetson_RELEASE/edk2/MdeModulePkg/Universal/Console/TerminalDxe/TerminalConIn.c(2078): ((BOOLEAN)(0==1))

Resetting the system in 5 seconds.
ÿäÿâShutdown state requested 1
Rebooting system ...

From your last log, I found there’s an assertion issue in UEFI causing reboot.
It seems a known issue in R35.2.1.

I would suggest verifying with latest R35.3.1.

Hi Kevin, thanks for your reply. I’ve filled some answers in to your questions below:

During this time, there are various tests running, but nothing that we think could cause a booting issue. These tests are things like…using Python to collect images from cameras, storing data to a mounted SSD, etc. etc. Nothing that is modifying the boot process or anything like that.

It appears that, after some period of time, it becomes corrupted in some way which causes the booting process to not complete. But there isn’t any specific thing that we would have done recently that would cause the problem. We had just been operating the system normally when it becomes corrupted.

We have had different modules fail with R35.2.1 and R35.1.0. At this point all of our modules now have R35.3.1 flashed onto them. We have not yet seen the issue on a module with R35.3.1 on it, but it requires extensive testing and it’s not easy to say when/if we will see the problem.

As far as the final “reboot cycle” issue – we haven’t seen this issue recently (we don’t currently have this issue), and we’ve updated to R35.3.1 so hopefully we will not run into the issue any further.

Thank you for your help. Let me know if you have any further ideas, otherwise I will update the thread with any future news as it happens.

From your fail log, it seems boot up stuck in UEFI so that I think the issue might relate to UEFI.

Please just monitoring if there’s any issue with R35.3.1.
You could also use r35.3.1-updates branch with latest fixes included.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.