Rootfs A/B redundancy fail-over mechanism in Jetpack5.1

There are several topics about Fail-over Rootfs Slot Switching in Jetpack 5.1.
We found many users having this request so that we share some tips and the verified flow in this post.

Verification steps on the Xavier NX devkit with eMMC

Step 1: Flash the board with rootfs A/B enabled 
$sudo ROOTFS_AB=1 ./flash.sh jetson-xavier-nx-devkit-emmc mmcblk0p1

Step 2: After boot up, check current slot status
$sudo nvbootctrl -t rootfs dump-slots-info
Current rootfs slot: A
Active rootfs slot: A
num_slots: 2
slot: 0,             retry_count: 3,             status: normal
slot: 1,             retry_count: 3,             status: normal

Step 3: Try to corrupt current file system (current slot: A)
$sudo rm -rf /lib

Step 4: Reset the board
Re-powering the board to perform reset

Step 5: rootfs A/B fail-over mechanism
5-1. It would hit kernel panic at boot up (due to filesystem corrupted)
5-2. watchdog trigger reset after 120s
5-3. totally retry 3 times to enter rootfsA (slot: 0)
5-4. UEFI found rootfsA is unbootable(rootfs A tried 3 times and failed), trigger reboot to switch rootfs slot
5-5. switching to rootfsB

Step 6: After boot up, check current slot status again
$sudo nvbootctrl -t rootfs dump-slots-info
Current rootfs slot: B                                                          
Active rootfs slot: B                                                           
num_slots: 2                                                                    
slot: 0,             retry_count: 0,             status: unbootable             
slot: 1,             retry_count: 3,             status: normal

We provide the flash and full serial console log as following for your reference.
flash.log (70.4 KB)
serial.log (273.2 KB)

The methods to restore corrupted rootfs slot

1. UEFI menu

a. Press `ESC` to enter UEFI Menu
b. choose Device Manager -> NVIDIA Configuration -> L4T Configuration 
c. OS chain A status: The value is Unbootable if UEFI attempts recovery kernel, choose Normal 
d. Save and exit, reboot, UEFI will try Direct Boot

2. User space command

user can restore the UEFI variable RootfsStatusSlotA-781e084c-a330-417c-b678-38e696380cb9 or RootfsStatusSlotB-781e084c-a330-417c-b678-38e696380cb9 in kernel(write value 0).

2-1. For AGX Xavier and the devices without QSPI flash:

a. mount esp to /opt/nvidia/esp
b. write variable to esp
    $cd /opt/nvidia/esp/EFI/NVDA/Variables/
    $printf "\x07\x00\x00\x00\x00\x00\x00\x00" > /tmp/var_tmp.bin
    $sudo dd if=/tmp/var_tmp.bin of=RootfsStatusSlotA-781e084c-a330-417c-b678-38e696380cb9
c. reboot, when system boot to UEFI, UEFI will write RootfsStatusSlotA value to uefi_variable partition.
d. After system boots to rootfs successfully(for example restore rootfs A status, boot to rootfs B), we can check that the RootfsStatusSlotA is restored.

2-2. For other device with QSPI flash:

a. write variable to efi
    $cd /sys/firmware/efi/efivars/
    $printf "\x07\x00\x00\x00\x00\x00\x00\x00" > /tmp/var_tmp.bin
    $sudo chattr -i RootfsStatusSlotA-781e084c-a330-417c-b678-38e696380cb9
    $sudo dd if=/tmp/var_tmp.bin of=RootfsStatusSlotA-781e084c-a330-417c-b678-38e696380cb9
b. The RootfsStatusSlotA variable is restored immediately.

Known Issues

1. Xavier NX with SD module may not work.

There’s a watchdog default disabled issue. We are still finding the cause. For a quick workaround, you could refer to this thread to enable it manually.

2. The “endless reboot” in this use case.

There’s a bug in UEFI and we have gotten the root cause. The solution is under verification. It might be fixed in the later Jetpack release.

Thankyou @KevinFFF . By when can we expect the Jetpack release which solves the “endless reboot” issue?

Thank you @KevinFFF for the support. This solution seems to be working.

I suggest the following code snippet, if you want to make both slots bootable again:

For AGX and without QSPI

    mount PARTLABEL=esp /opt/nvidia/esp
	for ROOTFS_STATUS in RootfsStatusSlotA-781e084c-a330-417c-b678-38e696380cb9 RootfsStatusSlotB-781e084c-a330-417c-b678-38e696380cb9; do
		if [ -e /opt/nvidia/esp/EFI/NVDA/Variables/$ROOTFS_STATUS ]; then
			chattr -i /opt/nvidia/esp/EFI/NVDA/Variables/$ROOTFS_STATUS
			printf "\x07\x00\x00\x00\x00\x00\x00\x00" | dd of=/opt/nvidia/esp/EFI/NVDA/Variables/$ROOTFS_STATUS oflag=sync
		fi
	done

With QSPI

	for ROOTFS_STATUS in RootfsStatusSlotA-781e084c-a330-417c-b678-38e696380cb9 RootfsStatusSlotB-781e084c-a330-417c-b678-38e696380cb9; do
		if [ -e /sys/firmware/efi/efivars/$ROOTFS_STATUS ]; then
			chattr -i /sys/firmware/efi/efivars/$ROOTFS_STATUS
			printf "\x07\x00\x00\x00\x00\x00\x00\x00" | dd of=/sys/firmware/efi/efivars/$ROOTFS_STATUS oflag=sync
		fi
	done

Explanation

  • Please be sure to only use the PARTLABEL option if you have just one disk with this partition. If you have two partitions with the same name, you’ll randomly get one selected.
  • Some people might have a read only FS, so the /tmp directory might not be writable
  • Since the change of the boot slot usually is done directly before rebooting the oflag=sync will make sure that the data has actually been written and is not cached and lost.

May I suggest that this step is either added to the nvbootctl command “set-active-boot-slot” or added as a separate command of the tool? This is the second time we have to use the “printf” workaround already.

Thanks a lot KevinFFF. I will try this solutions (I think I tried something similar and did not work, but I might have skipped something).

Regards,
Alvaro.

@KevinFFF The issues are not listed in the Known Issues section of 5.1.1. Does that mean they are all fixed there?

I’ve tested the new version. The bugs are not fixed.
So they are targeted for 5.2?

@KevinFFF Any info of the state of your fix?

As it is not implemented in 5.1.1, will it be in 5.2 or will you add a verrsion 5.1.2 due to it’s importance?

@WayneWWW @JerryChang

Can anyone comment on this issue?

Let me update the current status:

Issue 1 is about the devkit with SD module only, not production module, we’ve only the workaround for this issue.

Issue 2 will be fixed in the next release (JP5.1.2)

1 Like

Hi,

I am working to get the fail over to work on both jetpack 4.6 and jetpack 5.1 on nvme.
We are working on a product and it is important to have OS fail over for our application.
I followed the steps for making redundant rootfs.

On Jetpack 5.1.1, When I check the status of the watchdog using this command, it is disabled.
cat /proc/device-tree/watchdog@30c0000/status
I did the test again on Jetpack 5.1.2 on nvme, the watch dog is still disabled.
I tested Jetpack 5.1.3, on nvme it does not even boot. The Jetpack is problematic at the first place before doing any redundant rootfs steps.
On Jetpack 4.6.3 on nvme, after running the redundant rootfs command, the OS does not boot (note this is happening before corrupting the rootfs for testing).

Redundant rootfs did not work on any of these Jetpack versions.
I am mainly interested in Jetpack 5.1.1 since a lot of our products are flashed with this version. I don’t know how to enable the watchdog flag. I checked the thread under known issues 1, still I don’t know to fix this problem. Any instructions are appreciated.

Regards,
Farough

Watchdog could be enabled through device tree.
Please share the full dmesg and also the device tree for further check.

Where can I find documentation on how to modify the device tree ? Do I need to re-flash the Jetson?
What change do I need to make to enable the watchdog?

Thanks

Please use dtc tool to decompile the dtb in /boot/dtb/kernel_XXX.dtb to dts on your board.
After modify the device tree, please run dtc again to assemble it back to dtb and reboot the board to apply the change.