Bootloader does not fall-back to slot A when Slot B can't boot (rootfs A/B)

We are experimenting with the rootfs A/B to implement our own image update process.

We plan to use the APT approach (NOT the image approach), for that reason our plan is to

  • flash two partitions with the OS (rootfs A/B)
  • e.g. mount the rootfs of the non-current slot X and make changes on it via chroot
  • select the changed slot for booting (nvbootctrl set-active-boot-slot X) and reboot

All of the above worked so far, but when we test the mechanism to fallback to slot Y if slot X is not bootable, the xavier hangs and no fallback will be done no matter how long we wait.

To make slot X unbootable we replaced the file /boot/Image once with an empty file and once with a link to a valid Image file. When the empty file was in place we hang at the nvidia splashscreen, with the link we had a black-screen with not splash whatsoever. So it seems the bootloader does not detect a problem and/or does not fallback to slot Y.

We are using the latest L4T 32.6 tarball and flashed the board via ROOTFS_AB=1 ./nvflash.sh jetson-xavier mmcblk0p1

Here is our smd file from bootloader/smd_info.rootfs_AB.cfg. Note that we adjusted MAX_BL_RETRY_COUNT and MAX_ROOTFS_AB_RETRY_COUNT to only one allowed failure. That was to reduce wait times in case we had to wait for the bootloader to detect a unsuccessfull boot.

# SMD metadata information
< VERSION 5 >
# Set the maximum boot slot retry count
# Please make sure this field is set before slot info config
# The valid setting is 1 to 7
< MAX_BL_RETRY_COUNT 1 >

# Set the maximum rootfs slot retry count
# Please make sure this field is set before slot info config
# The valid setting is 1 to 3
< MAX_ROOTFS_AB_RETRY_COUNT 1 >

#
# Config 1: Disable A/B support (by removing comments ##)
#

# slot info order is important!
# <priority>    <suffix>  <boot_successful>
##15                  _a        1

#
# Config 2: Enable rootfs A/B support (default)
#
< REDUNDANCY_ENABLE 1 >
< ROOTFS_AB 1 >

# To enable rootfs autosync, use < RF_AUTOSYNC_ENABLE 1 >
# This option must be defined after "< ROOTFS_AB 1 >"
##< RF_AUTOSYNC_ENABLE 1 >

# Select rootfs A as the active rootfs
< ROOTFS_ACTIVE_A 1 >
##< ROOTFS_ACTIVE_B 1 >

# Enable/disable unified bootloader AB and rootfs AB
# Set 1 to enable, set 0 to disable. Default is enabled.
# This option must be defined after "< ROOTFS_AB 1 >"
# When < ROOTFS_BL_UNIFIED_AB 1 > is set,
# auto sync for both BL and RF are disabled.
< ROOTFS_BL_UNIFIED_AB 1 >

# To disable bootloader autosync, use < BL_AUTOSYNC_DISABLE 1 >, default is disabled.
# REDUNDANCY_ENABLE or REDUNDANCY_USER must be defined before BL_AUTOSYNC_DISABLE !
< BL_AUTOSYNC_DISABLE 1 >

# slot info order is important!
# <priority>    <suffix>  <boot_successful>
15                  _a        1
14                  _b        1

Those adjustments were applied via running ./nv_smd_generator smd_info.rootfs_AB.cfg slot_metadata.bin.rootfsAB

Are there other settings needed for our simple test to succeed?

Hi,
Please refer to this page and get the uart log for reference:
Jetson/General debug - eLinux.org

Do you observe the issue on Xavier developer kit or custom board?

@DaneLLL okay, we will try to get the log via UART, but this could take some days since it seems we need a special hardware for this.

We were observing this on a “custom” board, its the “stevie-xavier” from diamond systems (STEVIE™ Carrier and Dev Kit for NVIDIA Jetson AGX Xavier). But we will also check this out on the xavier devkit again.

Just to have this confirmed: we do not miss any step in our description above? So this should work on a xavier devkit how we do it?

hello brootux,

the logic is that if the retry_count reaches 0, then the CBoot will select another RootFS slot.
it only check and boot from next slot when retry count < 0; the default value of Rootfs_Retry_Count is 3.

there’s a background service, nv_update_verifier.service; in this service, it will trigger the l4t-rootfs-validation-config.service first, it provides an interface to users to customize when to say the boot is successful. If the validation script doesn’t exist or returns true, that means the rootfs boot up successful.
if the rootfs validation is true, then the nv_update_verifier.service will run /usr/sbin/nv_update_engine --verify, the nv_update_engine will increase the retry_count and update slot status.

here’s see-also topic for your reference, Topic 197124.
thanks

Hello JerryChang,

thanks for this on-point summary. We already read a lot about the nv_update_verifier.service before, but this whole mechanism will only run after a kernel was booted succesfully. We plan to use it to verify that our deployed application(s) are fine, but its not useful to guarantee changes to e.g. Bootloader, Kernel, Devicetree, Rootfs are fine.

So what is described in Topic 197124 means that the current bootloader in 32.6 is not capable for what we are testing currently, right?

Only in 32.7 it will be fixed and the bootloader will count-down the retries on a failed boot, is that right?

hello brootux,

you’re talking about bootloader redundancy and also rootfs redundancy.
may I know what’s your test procedure, did you crash slot-b intentionally and force it boot into slot-b for verification?

Hello Jerry Chang,

yes there is a detailed description on top of this thread. Here is a short summary:

  • We have unified bootloader (our understanding of this is we have two slots which have its own BL and rootfs)
  • We flashed both slots with the same image via ROOTFS_AB=1 ./flash.sh ...
  • We crashed slot-b by replacing /boot/Image in slot-b-rootfs with an empty file
  • We force it to boot from slot-b via nvbootctrl set-active-boot-slot 1

From what we can observe (only monitor attached) is that the bootloader does not retry booting slot-b and also does not fall-back to slot-a which is untouched and should boot.

hello brootux,

this is an incorrect test steps, this is the cboot loading the kernel image via file system,
please refer to CBoot session, it’s [Kernel Boot Sequence Using extlinux.conf] to load the kernel binary file from the LINUX entry, otherwise, the kernel binary is loaded from the kernel partition.

so, the correct test steps should removing the LINUX entry, and loading the kernel via partitions. you may examine all the partitions as following, i.e. $ ls -al /dev/disk/by-partlabel, you should use the dd commands to crash the partition. reboot the system and check the bootloader logs, it’ll have 7-time retry (it’s bootloader side default retry counts) and finally boot into another slot for booting-up.

Hello JerryChang,

thanks for the insight, we now flashed again and tried crashing the partition by writing all zeroes and removed the LINUX entry. With this test we realized that when we are resetting the board by hand 7 times, we ran into the fallback. So it seems the problem was that we were assuming the bootloader will reboot automatically if kernel couldn’t be loaded.

Is there any timeout which reboots the board when a kernel takes too long to load or does not work at all?

hello brootux,

you may dig into bootloader logs, please setup the serial console via port J501.
this retry counts should works (reduce the retry times, reload the binaries automatically) by itself, please also share the logs for reference,
thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.