Orinnx reboot repeatly but it switch to B unexpectly

hi

JP 5.1.3

HW orinnx 16GB

I did a test that reboot orinnx repeatly very 20 seconds after nv-l4tbootloader-config.service boot up successfully at slot A. But it switched to slot B unexpectly and it reported bootloader of slot A was unbootable.
Then powered off orinnx module and powered on orinnx agin, the bootloader of slot A return to bootable autoly, how did it that ?

rest_reboot.log (41.3 KB)

*** Please note that this reply is generated by LLM automatically ***

Based on the provided context, it appears that the Jetson Orin NX module is experiencing an issue with the bootloader, where it repeatedly reboots every 20 seconds after successfully booting up with the nvl4tbootloaderconfigservice at slot A. However, it unexpectedly switches to slot B and reports that the bootloader of slot A is unbootable.

This behavior suggests that there might be an issue with the bootloader configuration or the slot switching mechanism. Here are a few possible explanations:

  1. Bootloader configuration issue: The nvl4tbootloaderconfigservice might be configuring the bootloader incorrectly, leading to the repeated reboots and slot switching. It’s possible that the configuration is not being persisted correctly, causing the bootloader to switch to slot B.
  2. Slot switching mechanism issue: The slot switching mechanism might be malfunctioning, causing the bootloader to switch to slot B unexpectedly. This could be due to a hardware or firmware issue.
  3. Power cycle issue: The fact that the bootloader of slot A returns to bootable automatically after powering off and on the Orin NX module suggests that there might be a power cycle-related issue. It’s possible that the power cycle is causing the bootloader to reset or reconfigure itself.

To troubleshoot this issue, you can try the following steps:

  1. Check the bootloader configuration: Verify that the nvl4tbootloaderconfigservice is configuring the bootloader correctly. You can check the configuration files and logs to ensure that the configuration is being applied correctly.
  2. Check the slot switching mechanism: Investigate the slot switching mechanism to ensure that it is functioning correctly. You can check the logs and configuration files to see if there are any errors or issues related to slot switching.
  3. Check the power cycle: Verify that the power cycle is not causing any issues with the bootloader. You can try powering off and on the Orin NX module multiple times to see if the issue persists.
  4. Check the bootloader logs: Check the bootloader logs to see if there are any errors or issues related to the repeated reboots and slot switching.
  5. Try updating the bootloader: Try updating the bootloader to the latest version to see if it resolves the issue.

In comparison to other Jetson modules, the Orin NX module has a similar bootloader configuration and slot switching mechanism. However, the specific issue you are experiencing might be unique to the Orin NX module or the specific hardware configuration.

To provide more detailed assistance, I would need more information about the specific hardware configuration, the bootloader configuration, and the logs related to the issue.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

nvl4tbootloaderconfigservice had set up successfully

hello smileandcry2023,

may I also know your steps in details to reboot Orin NX repeatedly?

BTW, you may try moving forward to the latest JP-5, or, please try to apply the fixes from Assertion issue in UEFI during boot for testing.

1, I added a test service and it boot up after nv-l4t-bootloader-config.service and it called shell script beleow

2, it would record current slot information to the rest_reboot.log every bootup

#!/bin/bash

res1=$(nvbootctrl dump-slots-info)
res2=$(nvbootctrl dump-slots-info -t rootfs)

echo ā€œ************ ${res1}ā€ >> /app/data/rest_reboot.log

echo ā€œ************ ${res2}ā€ >> /app/data/rest_reboot.log

sleep 20

if [ ! -e /dev/sda1 ]; then
echo ā€œā€
reboot
fi
exit 0

hello smileandcry2023,

actually, systemd will do lots of service to initialize while system has booted up.
you should calling $ sudo systemctl status nv-l4t-bootloader-config, to confirm background service is executed successfully(status=0/SUCCESS) before rebooting the system.

I added After=nv-l4t-bootloader-config in my service, if it run failed, my service would be called

hello smileandcry2023,

please adding status check for background service.
if that’s possible, please moving to the latest JP-5 since there’re several bugs fixes included.

test_reboot_20250108_v1.log (26.6 KB)

ā€œsudo systemctl status nv-l4t-bootloader-configā€ it shows this service boot up successfully, but system switched frome A to B unexpectly stilly.

Current version: 35.5.0
as mentioned, you may try moving forward to the latest JP-5, (jetson-linux-r3562) or, please try to apply the fixes from Assertion issue in UEFI during boot for testing.

We face a very similar issue and we are in 35.6.2 and assertions were addressed. Initially we though this could come form our custom board, but now we are able to reproduce it on an official devkit.

From current investigation, 25% of the modules seems affected by this boot swap issue the other does not show this phenomenon.

Can internal try to reproduce this? Surely you must have some stress test environment available.

We are doing some test to locat wher the issue is. Have you solved it?

Sadly no, and this impact the production line so many people from different departments in my company are working to identify the culprit for months now.

We have setup bench of stress tests and we have tried countless experiments but no luck so far. Lately we were able to reproduce on an official devkit and as you face similar problem….

Current knowledge is heading toward an issue within the Pcie/nvme driver within UEFI but who knows.

Do you use an SSD?

Yeap, that’s also what we observe. The ā€˜unbootable’ flag for the rootfs seems to persist a power cycle. Not the ā€˜unbootable’ flag for the bootloader which is cleared at power cycles.

Also when, during stress tests, the error happens twice (A→B→A) without doing a powercycle, first the bootloader A is marked unbootable and it boots on B, then when the error occurs again it reboots on A (which is marked as normal again) and B gets marked unbootable.

As SƩb said, it affects somewhere between 15% and 25% of our OrinNX 16GB modules. We have reproduced it on an official nvidia devkit as well. Average number of reboots between occurrences is somewhere in the 700 range. It varies between 60 and 6000.

hello martin.herren,

according to above, may I know the SKUs of those specific Orin modules.
please also share the test steps for reference, we would like to reproduce this issue locally.

Yes, we use SSD

have you reproduced this problem?

Currently out of 7 impacted modules:

  • 1x 699-13767-0000-301 H.1
  • 2x 699-13767-0000-301 G.1
  • 2x 699-13767-0000-303 B.1
  • 2x 699-13767-0000-303 C.1

We have other 699-13767-0000-301 G.1 that we stress tested for over 15’000 reboot loops without any issue. So there is no specific board revision that has the issue systematically. I try to get more info from prod about the quarantined modules to have a more complete board revision list on impacted modules.

The test setup is simple, we have them flashed with ROOTFS_AB, and there is a script installed on both, A and B slots, that checks the current slot and logs if it is not the same as the previous slot, including the number of reboots since last slot switch, and reboots. The script always waits for an uptime of 1 minute before running, not to trigger the built-in rollback feature.

Impacted modules have an average of a few hundred reboots between occurrences. Sometimes less. Not impacted modules have done over 10k reboots without an issue. We tried it both with soft reboots with the script, as well as more complex setups with power off and external power cycling. On impacted modules the issue happens with both soft and hard reboots.

Updated with an additional board revision

According to some boot logs when the issues arises, it might well come from some NVMe initialization error during UEFI.