We are experimenting with the rootfs A/B to implement our own image update process.
We plan to use the APT approach (NOT the image approach), for that reason our plan is to
flash two partitions with the OS (rootfs A/B)
e.g. mount the rootfs of the non-current slot X and make changes on it via chroot
select the changed slot for booting (nvbootctrl set-active-boot-slot X) and reboot
All of the above worked so far, but when we test the mechanism to fallback to slot Y if slot X is not bootable, the xavier hangs and no fallback will be done no matter how long we wait.
To make slot X unbootable we replaced the file /boot/Image once with an empty file and once with a link to a valid Image file. When the empty file was in place we hang at the nvidia splashscreen, with the link we had a black-screen with not splash whatsoever. So it seems the bootloader does not detect a problem and/or does not fallback to slot Y.
We are using the latest L4T 32.6 tarball and flashed the board via ROOTFS_AB=1 ./nvflash.sh jetson-xavier mmcblk0p1
Here is our smd file from bootloader/smd_info.rootfs_AB.cfg. Note that we adjusted MAX_BL_RETRY_COUNT and MAX_ROOTFS_AB_RETRY_COUNT to only one allowed failure. That was to reduce wait times in case we had to wait for the bootloader to detect a unsuccessfull boot.
# SMD metadata information
< VERSION 5 >
# Set the maximum boot slot retry count
# Please make sure this field is set before slot info config
# The valid setting is 1 to 7
< MAX_BL_RETRY_COUNT 1 >
# Set the maximum rootfs slot retry count
# Please make sure this field is set before slot info config
# The valid setting is 1 to 3
< MAX_ROOTFS_AB_RETRY_COUNT 1 >
#
# Config 1: Disable A/B support (by removing comments ##)
#
# slot info order is important!
# <priority> <suffix> <boot_successful>
##15 _a 1
#
# Config 2: Enable rootfs A/B support (default)
#
< REDUNDANCY_ENABLE 1 >
< ROOTFS_AB 1 >
# To enable rootfs autosync, use < RF_AUTOSYNC_ENABLE 1 >
# This option must be defined after "< ROOTFS_AB 1 >"
##< RF_AUTOSYNC_ENABLE 1 >
# Select rootfs A as the active rootfs
< ROOTFS_ACTIVE_A 1 >
##< ROOTFS_ACTIVE_B 1 >
# Enable/disable unified bootloader AB and rootfs AB
# Set 1 to enable, set 0 to disable. Default is enabled.
# This option must be defined after "< ROOTFS_AB 1 >"
# When < ROOTFS_BL_UNIFIED_AB 1 > is set,
# auto sync for both BL and RF are disabled.
< ROOTFS_BL_UNIFIED_AB 1 >
# To disable bootloader autosync, use < BL_AUTOSYNC_DISABLE 1 >, default is disabled.
# REDUNDANCY_ENABLE or REDUNDANCY_USER must be defined before BL_AUTOSYNC_DISABLE !
< BL_AUTOSYNC_DISABLE 1 >
# slot info order is important!
# <priority> <suffix> <boot_successful>
15 _a 1
14 _b 1
Those adjustments were applied via running ./nv_smd_generator smd_info.rootfs_AB.cfg slot_metadata.bin.rootfsAB
Are there other settings needed for our simple test to succeed?
the logic is that if the retry_count reaches 0, then the CBoot will select another RootFS slot.
it only check and boot from next slot when retry count < 0; the default value of Rootfs_Retry_Count is 3.
there’s a background service, nv_update_verifier.service; in this service, it will trigger the l4t-rootfs-validation-config.service first, it provides an interface to users to customize when to say the boot is successful. If the validation script doesn’t exist or returns true, that means the rootfs boot up successful.
if the rootfs validation is true, then the nv_update_verifier.service will run /usr/sbin/nv_update_engine --verify, the nv_update_engine will increase the retry_count and update slot status.
here’s see-also topic for your reference, Topic 197124.
thanks
thanks for this on-point summary. We already read a lot about the nv_update_verifier.service before, but this whole mechanism will only run after a kernel was booted succesfully. We plan to use it to verify that our deployed application(s) are fine, but its not useful to guarantee changes to e.g. Bootloader, Kernel, Devicetree, Rootfs are fine.
So what is described in Topic 197124 means that the current bootloader in 32.6 is not capable for what we are testing currently, right?
Only in 32.7 it will be fixed and the bootloader will count-down the retries on a failed boot, is that right?
you’re talking about bootloader redundancy and also rootfs redundancy.
may I know what’s your test procedure, did you crash slot-b intentionally and force it boot into slot-b for verification?
yes there is a detailed description on top of this thread. Here is a short summary:
We have unified bootloader (our understanding of this is we have two slots which have its own BL and rootfs)
We flashed both slots with the same image via ROOTFS_AB=1 ./flash.sh ...
We crashed slot-b by replacing /boot/Image in slot-b-rootfs with an empty file
We force it to boot from slot-b via nvbootctrl set-active-boot-slot 1
From what we can observe (only monitor attached) is that the bootloader does not retry booting slot-b and also does not fall-back to slot-a which is untouched and should boot.
this is an incorrect test steps, this is the cboot loading the kernel image via file system,
please refer to CBoot session, it’s [Kernel Boot Sequence Using extlinux.conf] to load the kernel binary file from the LINUX entry, otherwise, the kernel binary is loaded from the kernel partition.
so, the correct test steps should removing the LINUX entry, and loading the kernel via partitions. you may examine all the partitions as following, i.e. $ ls -al /dev/disk/by-partlabel, you should use the dd commands to crash the partition. reboot the system and check the bootloader logs, it’ll have 7-time retry (it’s bootloader side default retry counts) and finally boot into another slot for booting-up.
thanks for the insight, we now flashed again and tried crashing the partition by writing all zeroes and removed the LINUX entry. With this test we realized that when we are resetting the board by hand 7 times, we ran into the fallback. So it seems the problem was that we were assuming the bootloader will reboot automatically if kernel couldn’t be loaded.
Is there any timeout which reboots the board when a kernel takes too long to load or does not work at all?
you may dig into bootloader logs, please setup the serial console via port J501.
this retry counts should works (reduce the retry times, reload the binaries automatically) by itself, please also share the logs for reference,
thanks