Xavier AGX with redundant A/B rootfs reboots spontaneously on slot 1 instead of 0 on reboot

martin.herren · December 22, 2021, 12:46pm

Hello,

today while trying to reproduce a network issue that happens once every 10 boots in average, i rebooted quite a few times a xavier on JetPack 4.6 with redundant A/B rootfs. I was only working in ‘slot 0’, thus the first rootfs partition.
At one time on reboot it spontaneously rebooted on slot 1 without having been told to do so.

Attached the console log covering last 2 boots:

last successful boot on slot 0 (log truncated due to limited log buffer, but complete from ‘boot internal storage’)
boot on wrong slot
boot.log (49.5 KB)

The only commands issued where ip a to check if the bug happened, followed by a sudo reboot after which it booted on slot 1.

After booting on slot 1:

ansible@xavier-0:~$ sudo nvbootctrl dump-slots-info
Current bootloader slot: B
Active bootloader slot: B
magic:0x43424e00,             version: 5             features: 53             num_slots: 2
slot: 0,             priority: 14,             suffix: _a,             retry_count: 7,             boot_successful: 1
slot: 1,             priority: 15,             suffix: _b,             retry_count: 7,             boot_successful: 1

so nothing seems to show that it failed booting slot 0 and fell back to 1.

Doing several more reboots, it stuck booting on slot 1, until i issued a sudo nvbootctrl set-active-boot-slot 0 and rebooted, where it rebooted fine into slot 0 as previously. Nothing has been reflashed. It was using the same slot 0 rootfs as during the last few days before the bug occured.

What would have caused such a spontaneous boot slot change ?

Thanks and best regards,

Martin

JerryChang · December 23, 2021, 2:21am

hello martin.herren,

please refer to this thread, Topic 197124.
you should use l4t-rootfs-validation-config.service to validate the rootfs,
thanks

martin.herren · January 8, 2022, 12:42pm

There still seems to be another issue. We had the issue 3 times in the mean time, this makes us really scared to use the xavier boards in production as we should soon do.

The default behavior of l4t-rootfs-validation-config.service when not customized is to return success, which is our case.

On slot 0 last successfull boot, nv_update_verifier seems happy:

– Reboot –
Jan 08 12:17:53 xavier systemd[1]: Started nv_update_verifier service.
Jan 08 12:17:53 xavier nv_update_engine[6195]: Nvidia A/B-Redundancy Update tool Version 2.1
Jan 08 12:17:53 xavier nv_update_engine[6195]: verifying update with unified a/b enabled
Jan 08 12:17:53 xavier nv_update_engine[6195]: Verify bootloader update begins.
Jan 08 12:17:53 xavier nv_update_engine[6195]: The rotate count has been restored.
Jan 08 12:17:53 xavier nv_update_engine[6195]: The current slot 0 is marked as boot successful
Jan 08 12:17:53 xavier nv_update_engine[6195]: SM: S1
Jan 08 12:17:53 xavier nv_update_engine[6195]: The priority of current slot 0 has been restored.

Nevertheless after a reboot it boots straight into slot 1, no failed boot in between:

Jan 08 12:22:41 partb systemd[1]: Started nv_update_verifier service.
Jan 08 12:22:41 partb nv_update_engine[7590]: Nvidia A/B-Redundancy Update tool Version 2.1
Jan 08 12:22:41 partb nv_update_engine[7590]: verifying update with unified a/b enabled
Jan 08 12:22:41 partb nv_update_engine[7590]: Verify bootloader update begins.
Jan 08 12:22:41 partb nv_update_engine[7590]: The rotate count has been restored.
Jan 08 12:22:41 partb nv_update_engine[7590]: The current slot 1 is marked as boot successful
Jan 08 12:22:41 partb nv_update_engine[7590]: SM: S1
Jan 08 12:22:41 partb nv_update_engine[7590]: The priority of current slot 1 has been restored.

After setting the active slot back to 0 with nvbootctrl (which always reports both slots as successfully booted) and rebooting, it boots again on slot 0 but nv_update_verifier reports differently:

– Reboot –
Jan 08 12:32:34 xavier systemd[1]: Started nv_update_verifier service.
Jan 08 12:32:34 xavier nv_update_engine[6384]: Nvidia A/B-Redundancy Update tool Version 2.1
Jan 08 12:32:34 xavier nv_update_engine[6384]: verifying update with unified a/b enabled
Jan 08 12:32:34 xavier nv_update_engine[6384]: Verify bootloader update begins.
Jan 08 12:32:34 xavier nv_update_engine[6384]: The rotate count has been restored.
Jan 08 12:32:34 xavier nv_update_engine[6384]: SM: S21
Jan 08 12:32:34 xavier nv_update_engine[6384]: Checking whether Slot-A/B Redundancy and autosync are enabled.
Jan 08 12:32:34 xavier nv_update_engine[6384]: The retry count of current slot 0 has been restored.
Jan 08 12:32:34 xavier nv_update_engine[6384]: SM: S32
Jan 08 12:32:34 xavier nv_update_engine[6384]: Either Slot-A/B Redundancy or AutoSync is disabled.
Jan 08 12:32:34 xavier nv_update_engine[6384]: Marked current slot 0 boot successful

Subsequent reboots report again as initially, until the same sequence appears again with the xavier deciding to boot on slot 1.

JerryChang · January 9, 2022, 8:05pm

hello martin.herren,

do you have failure logs for reference?

to clarify,
it’s nvbootctrl to switch the rootfs slot. l4t-rootfs-validation-config.service to validate the rootfs, update_engine should always mark it as boot successful if the system is not in the OTA process.

martin.herren · January 10, 2022, 3:04pm

Thanks, looks like i had a timing issue.

As i was debugging a race condition in some network initialization i rebooted the xavier a lot: logged in, did some checks/logs, and rebooted it. Some times it seems i was too quick and nv_update_verifier service didn’t have time to validate the system and so it tried to ‘rollback’ to the other slot on the next boot.

Once i updated to some automatic ‘stress test’, the uptime was ever shorter between 2 tests and so the issue came almost systematically.

Seems once the system is up and running and allows to login, nv_update_verifier service still needs 1-2 minutes more to do its validations before rebooting otherwise the boot flags (active_boot_slot and boot_successfull) end up wrong. It starts like 1 minute after l4t-rootfs-validation-config.service. Now that i added a 2 minute sleep after my tests and before rebooting the issue didn’t appear.

A few more remarks:

if current slot is marked as booted unsuccessfull, doing a sudo nvbootctrl mark-boot-successful doesn’t set it as successful, it only sets the retry_count back to 7. Maybe it is just a consequence of the previous timing issue.
sudo systemctl status l4t-rootfs-validation-config.service reports the service as in bad-config instead of enabled, as its unit file is located under /opt/nvidia/l4t-rootfs-validation-config/ and not /etc/systemd/system. This is not a big issue as it will still run successfully on boot, it is just not possible to use systemctl to enable/disable it and can be confusing while debugging.

Thanks for your support.

system · January 24, 2022, 3:05pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Mark bootloader bootable Jetson Xavier NX security	27	2361	November 29, 2022
Wrong A/B boot Jetson AGX Xavier security , nvbugs	4	969	April 8, 2022
Bootloader does not fall-back to slot A when Slot B can't boot (rootfs A/B) Jetson AGX Xavier security	10	3126	February 23, 2022
Booting into recovery mode after continuous reboots - AGX Xavier on 5.1.2 Jetson AGX Xavier boot	6	43	January 20, 2025
L4T 5.1 reboot loop after enabling watchdog with RootFS A/B Jetson Xavier NX nvbugs	23	2074	August 1, 2023
Failed Bootloader Watchdog Recovery Jetson Xavier NX boot	3	755	October 18, 2021
Rootfs A/B redundancy fail-over mechanism in Jetpack5.1 Jetson Xavier NX kb	15	5026	June 16, 2025
Jetson AGX Xavier suddenly reboot Jetson AGX Xavier	11	1389	October 18, 2021
Need Help in Understanding Failover in RootFS A/B redundancy Jetson AGX Xavier security	13	1788	October 7, 2024
A/B ROOTFS Redundancy: Bootloader does not boot from backup slot when the working slot is intentionally corrupted Jetson AGX Xavier security , nvbugs	15	2093	March 24, 2023

Xavier AGX with redundant A/B rootfs reboots spontaneously on slot 1 instead of 0 on reboot

Related topics