Xavier AGX with redundant A/B rootfs reboots spontaneously on slot 1 instead of 0 on reboot

Hello,

today while trying to reproduce a network issue that happens once every 10 boots in average, i rebooted quite a few times a xavier on JetPack 4.6 with redundant A/B rootfs. I was only working in ‘slot 0’, thus the first rootfs partition.
At one time on reboot it spontaneously rebooted on slot 1 without having been told to do so.

Attached the console log covering last 2 boots:

  • last successful boot on slot 0 (log truncated due to limited log buffer, but complete from ‘boot internal storage’)
  • boot on wrong slot
    boot.log (49.5 KB)

The only commands issued where ip a to check if the bug happened, followed by a sudo reboot after which it booted on slot 1.

After booting on slot 1:

ansible@xavier-0:~$ sudo nvbootctrl dump-slots-info
Current bootloader slot: B
Active bootloader slot: B
magic:0x43424e00,             version: 5             features: 53             num_slots: 2
slot: 0,             priority: 14,             suffix: _a,             retry_count: 7,             boot_successful: 1
slot: 1,             priority: 15,             suffix: _b,             retry_count: 7,             boot_successful: 1

so nothing seems to show that it failed booting slot 0 and fell back to 1.

Doing several more reboots, it stuck booting on slot 1, until i issued a sudo nvbootctrl set-active-boot-slot 0 and rebooted, where it rebooted fine into slot 0 as previously. Nothing has been reflashed. It was using the same slot 0 rootfs as during the last few days before the bug occured.

What would have caused such a spontaneous boot slot change ?

Thanks and best regards,

Martin

hello martin.herren,

please refer to this thread, Topic 197124.
you should use l4t-rootfs-validation-config.service to validate the rootfs,
thanks

There still seems to be another issue. We had the issue 3 times in the mean time, this makes us really scared to use the xavier boards in production as we should soon do.

The default behavior of l4t-rootfs-validation-config.service when not customized is to return success, which is our case.

On slot 0 last successfull boot, nv_update_verifier seems happy:

– Reboot –
Jan 08 12:17:53 xavier systemd[1]: Started nv_update_verifier service.
Jan 08 12:17:53 xavier nv_update_engine[6195]: Nvidia A/B-Redundancy Update tool Version 2.1
Jan 08 12:17:53 xavier nv_update_engine[6195]: verifying update with unified a/b enabled
Jan 08 12:17:53 xavier nv_update_engine[6195]: Verify bootloader update begins.
Jan 08 12:17:53 xavier nv_update_engine[6195]: The rotate count has been restored.
Jan 08 12:17:53 xavier nv_update_engine[6195]: The current slot 0 is marked as boot successful
Jan 08 12:17:53 xavier nv_update_engine[6195]: SM: S1
Jan 08 12:17:53 xavier nv_update_engine[6195]: The priority of current slot 0 has been restored.

Nevertheless after a reboot it boots straight into slot 1, no failed boot in between:

Jan 08 12:22:41 partb systemd[1]: Started nv_update_verifier service.
Jan 08 12:22:41 partb nv_update_engine[7590]: Nvidia A/B-Redundancy Update tool Version 2.1
Jan 08 12:22:41 partb nv_update_engine[7590]: verifying update with unified a/b enabled
Jan 08 12:22:41 partb nv_update_engine[7590]: Verify bootloader update begins.
Jan 08 12:22:41 partb nv_update_engine[7590]: The rotate count has been restored.
Jan 08 12:22:41 partb nv_update_engine[7590]: The current slot 1 is marked as boot successful
Jan 08 12:22:41 partb nv_update_engine[7590]: SM: S1
Jan 08 12:22:41 partb nv_update_engine[7590]: The priority of current slot 1 has been restored.

After setting the active slot back to 0 with nvbootctrl (which always reports both slots as successfully booted) and rebooting, it boots again on slot 0 but nv_update_verifier reports differently:

– Reboot –
Jan 08 12:32:34 xavier systemd[1]: Started nv_update_verifier service.
Jan 08 12:32:34 xavier nv_update_engine[6384]: Nvidia A/B-Redundancy Update tool Version 2.1
Jan 08 12:32:34 xavier nv_update_engine[6384]: verifying update with unified a/b enabled
Jan 08 12:32:34 xavier nv_update_engine[6384]: Verify bootloader update begins.
Jan 08 12:32:34 xavier nv_update_engine[6384]: The rotate count has been restored.
Jan 08 12:32:34 xavier nv_update_engine[6384]: SM: S21
Jan 08 12:32:34 xavier nv_update_engine[6384]: Checking whether Slot-A/B Redundancy and autosync are enabled.
Jan 08 12:32:34 xavier nv_update_engine[6384]: The retry count of current slot 0 has been restored.
Jan 08 12:32:34 xavier nv_update_engine[6384]: SM: S32
Jan 08 12:32:34 xavier nv_update_engine[6384]: Either Slot-A/B Redundancy or AutoSync is disabled.
Jan 08 12:32:34 xavier nv_update_engine[6384]: Marked current slot 0 boot successful

Subsequent reboots report again as initially, until the same sequence appears again with the xavier deciding to boot on slot 1.

hello martin.herren,

do you have failure logs for reference?

to clarify,
it’s nvbootctrl to switch the rootfs slot. l4t-rootfs-validation-config.service to validate the rootfs, update_engine should always mark it as boot successful if the system is not in the OTA process.

Thanks, looks like i had a timing issue.

As i was debugging a race condition in some network initialization i rebooted the xavier a lot: logged in, did some checks/logs, and rebooted it. Some times it seems i was too quick and nv_update_verifier service didn’t have time to validate the system and so it tried to ‘rollback’ to the other slot on the next boot.

Once i updated to some automatic ‘stress test’, the uptime was ever shorter between 2 tests and so the issue came almost systematically.

Seems once the system is up and running and allows to login, nv_update_verifier service still needs 1-2 minutes more to do its validations before rebooting otherwise the boot flags (active_boot_slot and boot_successfull) end up wrong. It starts like 1 minute after l4t-rootfs-validation-config.service. Now that i added a 2 minute sleep after my tests and before rebooting the issue didn’t appear.

A few more remarks:

  • if current slot is marked as booted unsuccessfull, doing a sudo nvbootctrl mark-boot-successful doesn’t set it as successful, it only sets the retry_count back to 7. Maybe it is just a consequence of the previous timing issue.
  • sudo systemctl status l4t-rootfs-validation-config.service reports the service as in bad-config instead of enabled, as its unit file is located under /opt/nvidia/l4t-rootfs-validation-config/ and not /etc/systemd/system. This is not a big issue as it will still run successfully on boot, it is just not possible to use systemctl to enable/disable it and can be confusing while debugging.

Thanks for your support.

1 Like