Xavier NX A/B Failover

Hi,

We are in the process of deployng an IoT-style device with an NVIDIA Jetson Xavier NX. We are in the process of finalising the OTA upgrade process with this device.

To produce our update, we build a rootfs, kernel and bootloader (R35.3.1) and then use the NVIDIA tools to build a .BUP package. We are using an A/B partition layout, using NVMe as the primary storage device.

Generally, we have a working update system, using the NVIDIA tools to build and run the update. However, we have been doing some testing of the update process to ensure reliable in-the-field behaviour.

The result of this testing has thrown up a couple of issues that I’m hoping you may help with.

Failover not working as expected

To test that the A/B failover mechanism works as expected I ran a simple test. I programmed a device with our system image. Using nvbootctrl I then switched between the A/B partitions and rebooted, to ensure that both partitions are fully bootable. This worked fine. I then wanted to test that the A/B failover works correctly by destroying one of the rootfs partitions.

The block devices that are available on the running system are listed below:

user@instrument:~$ sudo blkid  
/dev/nvme0n1p1: UUID="0b4b8c2d-ddda-4c36-83bd-3d9444d79f5b" TYPE="ext4" PARTLABEL="APP" PARTUUID="26dde615-02f0-4dc3-aeb1-11a790298e62"  
/dev/nvme0n1p2: UUID="14ae0626-3b6d-4edf-8ae5-b9e06c38b4c9" TYPE="ext4" PARTLABEL="APP_b" PARTUUID="ed669db7-bee9-4cad-8b3f-ede9dc754a35"  
/dev/nvme0n1p3: PARTLABEL="kernel" PARTUUID="50c893a6-ee2e-42c7-8e49-650743d0af50"  
/dev/nvme0n1p4: PARTLABEL="kernel-dtb" PARTUUID="28eb2b0f-458b-4309-bc80-c046d91f8e58"  
/dev/nvme0n1p5: PARTLABEL="reserved_for_chain_A_user" PARTUUID="603c7b37-42b9-465f-a7e9-e64de5ee7352"  
/dev/nvme0n1p6: PARTLABEL="kernel_b" PARTUUID="14f793db-63dc-4d27-bf71-57165395bb43"  
/dev/nvme0n1p7: PARTLABEL="kernel-dtb_b" PARTUUID="63f603ae-42ce-46b1-a222-bf181928a640"  
/dev/nvme0n1p8: PARTLABEL="reserved_for_chain_B_user" PARTUUID="1a79c00c-d3fc-4c97-9ca6-e87c7b587145"  
/dev/nvme0n1p9: PARTLABEL="recovery" PARTUUID="696b0312-f4f4-4a72-9216-ac24f90d7670"  
/dev/nvme0n1p10: PARTLABEL="recovery-dtb" PARTUUID="118e8e0f-49ec-4f8e-b41e-cd50b5215762"  
/dev/nvme0n1p11: PARTLABEL="RECROOTFS" PARTUUID="5256381a-6742-4832-b8f1-06332a63417b"  
/dev/nvme0n1p12: UUID="5CE5-D962" TYPE="vfat" PARTLABEL="esp" PARTUUID="4b3bacce-7275-49c7-8383-cf530528782b"  
/dev/nvme0n1p13: PARTLABEL="recovery_alt" PARTUUID="5026b52e-6cea-41b6-aa16-ec7d09491e65"  
/dev/nvme0n1p14: PARTLABEL="recovery-dtb_alt" PARTUUID="1eddd0c6-8869-4443-9cde-d92874d4d302"  
/dev/nvme0n1p15: PARTLABEL="esp_alt" PARTUUID="1af4cb38-00ff-4199-8dfc-7943448b6e35"  
/dev/nvme0n1p16: UUID="23f3456a-0a05-411e-8971-d7011cd1607a" TYPE="ext4" PARTLABEL="UDA" PARTUUID="3e30d4fb-a32a-4062-bfe3-df7a0dd89b27"  
/dev/mmcblk0p1: UUID="010bb079-6311-451f-b0dc-982085810677" TYPE="ext4" PARTLABEL="APP" PARTUUID="6419e22e-32a4-4e17-928a-0a45769f8b72"  
/dev/mmcblk0p2: UUID="ad7ebab2-ccce-45ac-8144-20409cb1b18b" TYPE="ext4" PARTLABEL="APP_b" PARTUUID="167d1e89-ccf2-4270-9417-8b3a7687a940"  
/dev/mmcblk0p3: PARTLABEL="kernel" PARTUUID="6f9a86e7-b24a-43ed-9035-1c677c65bd25"  
/dev/mmcblk0p4: PARTLABEL="kernel-dtb" PARTUUID="28f8d7f2-7fbb-4e4f-978b-2c19be5de02d"  
/dev/mmcblk0p5: PARTLABEL="reserved_for_chain_A_user" PARTUUID="3033a786-e8c9-4e62-b4ee-8b5058673205"  
/dev/mmcblk0p6: PARTLABEL="secure-os_b" PARTUUID="0a5e9367-041a-4447-89bb-8c19c2efd777"  
/dev/mmcblk0p7: PARTLABEL="eks_b" PARTUUID="334b5db4-97d2-4607-8384-3657bdb17d03"  
/dev/mmcblk0p8: PARTLABEL="adsp-fw_b" PARTUUID="1b8cafa5-37b6-4c48-9f9d-0a37d391a67f"  
/dev/mmcblk0p9: PARTLABEL="rce-fw_b" PARTUUID="7a5f6a5a-2871-4c15-8931-3272e388dc10"  
/dev/mmcblk0p10: PARTLABEL="sce-fw_b" PARTUUID="5e85f563-489d-4cbd-9a10-86514b7c204e"  
/dev/mmcblk0p11: PARTLABEL="bpmp-fw_b" PARTUUID="30aafae8-462a-48a2-87e1-dd73dad2a359"  
/dev/mmcblk0p12: PARTLABEL="bpmp-fw-dtb_b" PARTUUID="16f1c5e5-6d5f-4d0a-9930-84076c6d2547"  
/dev/mmcblk0p13: PARTLABEL="kernel_b" PARTUUID="4b6d5628-1ecd-4810-84d4-574c8fe9cb55"  
/dev/mmcblk0p14: PARTLABEL="kernel-dtb_b" PARTUUID="0c5722e7-8fce-45e4-91d9-a34d9c80a23f"  
/dev/mmcblk0p15: PARTLABEL="reserved_for_chain_B_user" PARTUUID="7bec27a0-5d95-44da-9932-204345d77817"  
/dev/mmcblk0p16: PARTLABEL="recovery" PARTUUID="2122954b-cff8-4a2a-9869-1f17a5ff811b"  
/dev/mmcblk0p17: PARTLABEL="recovery-dtb" PARTUUID="763ff869-9a62-4951-8888-5e2ccdedc554"  
/dev/mmcblk0p18: PARTLABEL="RECROOTFS" PARTUUID="360ee2ff-98e2-4de4-986a-e622e7ddb966"  
/dev/mmcblk0p19: UUID="5755-DAD7" TYPE="vfat" PARTLABEL="esp" PARTUUID="3686df0d-4bdf-46c4-82b0-5d40f2a4784d"  
/dev/mmcblk0p20: PARTLABEL="recovery_alt" PARTUUID="23ceb93e-e15b-47e1-9e12-9e14670f3c6f"  
/dev/mmcblk0p21: PARTLABEL="recovery-dtb_alt" PARTUUID="1ff20028-e723-40f5-b6f8-07450f23492c"  
/dev/mmcblk0p22: PARTLABEL="esp_alt" PARTUUID="46da76f1-d248-42ab-aba3-eb6b919ec642"  
/dev/mmcblk0p23: PARTLABEL="UDA" PARTUUID="37862fdd-d604-4f0b-9675-3f5a28c5a858"  
/dev/loop0: SEC_TYPE="msdos" LABEL_FATBOOT="L4T-README" LABEL="L4T-README" UUID="1234-ABCD" TYPE="vfat"

With the device booted into the “A” partition:

user@instrument:~$ sudo nvbootctrl dump-slots-info  
Current version: 35.3.1  
Capsule update status: 1  
Current bootloader slot: A  
Active bootloader slot: A  
num_slots: 2  
slot: 0,             status: normal  
slot: 1,             status: normal  
user@instrument:~$ sudo nvbootctrl -t rootfs dump-slots-info
Current rootfs slot: A  
Active rootfs slot: A  
num_slots: 2  
slot: 0,             retry_count: 3,            status: normal  
slot: 1,             retry_count: 3,            status: normal

I then destroyed the “APP_b” partition by writing zeros to it:

user@instrument:~$ sudo dd if=/dev/zero of=/dev/nvme0n1p2 bs=1M
dd: error writing '/dev/nvme0n1p2': No space left on device  
  
20481+0 records in  
20480+0 records out  
21474836480 bytes (21 GB, 20 GiB) copied, 61.7725 s, 348 MB/s  

I then switched to the B partition with nvbootctrl so the machine will attempt to boot into the broken partition, and then rebooted.

My expectation would be that the bootloader attempts to boot the destroyed B partition, fails to do so, and then switches the boot chain to the A partition.

Instead, the L4TLoader attempts to read extlinux.conf from partition B, fails to find it and then falls back to booting into the recovery partition. This is not a very useful mechanism for us, because while booted into the recovery partition our OTA or remote access tools cannot correct the fault.

A copy of the serial console log during this process is attached:
fail_log2.txt (89.7 KB)

I modified the L4TLoader component of the UEFI bootloader to disable the fallback boot to the recovery partition. To do so, I commented out the following section at the end of Silicon/NVIDIA/Application/L4TLauncher/L4TLauncher.c:

   // Not in else to allow fallback
   if (BootParams.BootMode == NVIDIA_L4T_BOOTMODE_BOOTIMG) {
     ErrorPrint (L"%a: Attempting Kernel Boot\r\n", __FUNCTION__);
     Status = BootAndroidStylePartition (LoadedImage->DeviceHandle, BOOTIMG_BASE_NAME, BOOTIMG_DTB_BASE_NAME, &BootParams);
     if (EFI_ERROR (Status)) {
       ErrorPrint (L"Failed to boot %s:%d partition\r\n", BOOTIMG_BASE_NAME, BootParams.BootChain);
     }
   } else if (BootParams.BootMode == NVIDIA_L4T_BOOTMODE_RECOVERY) {
     ErrorPrint (L"%a: Attempting Recovery Boot\r\n", __FUNCTION__);
     Status = BootAndroidStylePartition (LoadedImage->DeviceHandle, RECOVERY_BASE_NAME, RECOVERY_DTB_BASE_NAME, &BootParams);
     if (EFI_ERROR (Status)) {
       ErrorPrint (L"Failed to boot %s:%d partition\r\n", RECOVERY_BASE_NAME, BootParams.BootChain);
     }
   }

While this change prevented the UEFI firmware from booting into the recovery partition, it instead switched to attempting to boot via the network, and hangs forever waiting for this. Still no closer to a working A/B failover.

What mechanism do I need to use to make the UEFI bootloader attempt to boot from only the NVMe disk, and perform an A/B failover in the case of a failed boot of one of the partitions?

To me it sounds like the mechanism described in Rootfs A/B redundancy fail-over mechanism in Jetpack5.1 relies on the Kernel panic/watchdog timer reset to mark the boot as a failure, but since we either get to a running recovery OS, or a forever stuck network boot, we will never actually failover properly.

This also looks to be a different behaviour to L4T 5.1 reboot loop after enabling watchdog with RootFS A/B - #11 by sanaurrehman because we do not boot forever - we just reboot once into the recovery OS.

Marking boot as successful

My second question is around the mechanism for marking a boot chain as "bootable or “damaged”. The documentation in Root File System — Jetson Linux Developer Guide documentation has the following statement:

If the current rootfs fails to boot a specified number of times, cpu-bootloader marks its Status attribute and switches the roles of the current and unused rootfs slots. If both root file systems are unbootable, the device tries to boot from the recovery kernel image.

What is the mechanism that cpu-bootloader uses to determine whether a boot chain is successful? Is it as simple as assuming that if a Kernel has started, the boot-chain is okay? From our point of view, the only “safe” point where we can mark a boot-chain as having booted properly is once the Linux userspace is fully started, and the third-party OTA management service we are using is running. Is it possible to have userspace tools perform this “marking successful” rather than cpu-bootloader?

Please let me know if there is any additional information that you need to help answer this question.

Hi bgillatt,

Are you using the devkit or custom board for Xavier NX?
Could you also verify with the latest R35.4.1?

It seems you could boot from slot b but mount rootfs failed.

[0001.357] I> Active Boot chain : 1
..
[   33.353439] ERROR: PARTUUID=ed669db7-bee9-4cad-8b3f-ede9dc754a35 mount fail...

Do you refer to the steps the that link to destroy /lib?

It is determined by scratch register.
Please refer to " The methods to restore corrupted rootfs slot" part in Rootfs A/B redundancy fail-over mechanism in Jetpack5.1

KevinFFF, thank you for your response.

We are using a custom board with our Xavier NX.

I can try the R35.4.1 release soon, but is there a substantial change in the way the bootloader works? I cannot see anything in https://docs.nvidia.com/jetson/archives/r35.4.1/ReleaseNotes/Jetson_Linux_Release_Notes_r35.4.1.pdf that suggests a resolution to this.

It seems you could boot from slot b but mount rootfs failed.

No – the slot B rootfs has been completely erased in my test. This simulates a fault in the filesystem that prevents the rootfs from being mounted at all.

Do you refer to the steps the that link to destroy /lib?

Yes I saw these steps. But I believe the test written on that page is not sufficient to show that the A/B failover is reliable. In the described case where ONLY the /lib part of the rootfs is damaged, the bootloader can mount the partition, read the extlinux.conf and start the Kernel. The Kernel would not properly initialise and so the watchdog timer is triggered and the slot is marked as a failure.

However, in the real world, there is absolutely the possibility that another part of the rootfs partition is damaged, causing the filesystem to not be readable at all. This behaviour would be consistent with my test, causing the problem I described in my original message.

So again, is there a way to configure the Jetson bootloader to mark a slot as a failure if it cannot read the rootfs, rather than attempting a recovery boot or network boot? If this is not possible, this is a serious bug in the way the bootloader has been designed that has a real chance of causing bricked devices in the field.

It is determined by scratch register.
Please refer to " The methods to restore corrupted rootfs slot" part in Rootfs A/B redundancy fail-over mechanism in Jetpack5.1

Yes that makes sense. The documentation in Update and Redundancy — Jetson Linux Developer Guide documentation suggests that bits 25-22 of scratch register 15 indicate to the bootloader whether the current boot slot is bootable. But what I would like to know is which parts of the NVIDIA bootloader / userspace are responsible for setting these bits? Does the bootloader set these if the Kernel hits a watchdog timeout? Are there additional NVIDA userspace tools that reset these bits on successful boot?

Maybe you could refer to the following thread about the mechanism of our retry count.
Once the retry count reaches zero, it would mark the partition is unbootable and switch to another slot.
Reset Timing of Boot Retry Count - #7 by KevinFFF

Using the latest R35.4.1 is to align with our status.
Your board is booting from slot b (bootloader part) but mount rootfs from slot b failed.
It may be caused from you erase the whole rootfs.
fail-over mechanism is frequently used when you hit the kernel panic causing boot up failed rather than no rootfs exists.

Are you using Xavier NX with eMMC or SD module?

KevinFFF,

Thanks again for your support. That thread explaining the retry mechanism was very useful.

I had been struggling to build the bootloader for the latest R35.4.1 release. Our normal CI build of the bootloader runs the following commands as part of the setup of the EDK-II environment:

edkrepo manifest-repos add nvidia https://github.com/NVIDIA/edk2-edkrepo-manifest.git main nvidia
edkrepo clone nvidia-uefi NVIDIA-Jetson r35.3.1

However, updating the version to R35.4.1 as follows does not work:

edkrepo manifest-repos add nvidia https://github.com/NVIDIA/edk2-edkrepo-manifest.git main nvidia
edkrepo clone nvidia-uefi NVIDIA-Jetson r35.4.1

Unfortunately, this results in failure to obtain the sources from the manifest repository:

Cloning global manifest repository to: /home/user/.edkrepo/edk2-edkrepo-manifest-main from: https://github.com/tianocore/edk2-edkrepo-manifest.git  
Cloning global manifest repository to: /home/user/.edkrepo/nvidia from: https://github.com/NVIDIA/edk2-edkrepo-manifest.git  
Verifying the global manifest repository entry for project: NVIDIA-Jetson

Error: The selected COMBINATION/SHA is not present in the project manifest file or does not exist.
Exiting without performing clone operation.

It appears that the manifest repository at edk2-edkrepo-manifest/edk2-nvidia/Jetson/NVIDIAJetsonManifest.xml at main · NVIDIA/edk2-edkrepo-manifest · GitHub is missing the relevant entries for the R35.4.1 release. It looks like the relevant changes have been languishing in a third-party pull request here: Add missing Jetson manifest versions for r35.4.1 by eh-steve · Pull Request #20 · NVIDIA/edk2-edkrepo-manifest · GitHub since August last year.

If I replace the NVIDIA manifest repo with the repo from “eh-steve”, the source of the pull request:

edkrepo manifest-repos remove nvidia
edkrepo manifest-repos add nvidia https://github.com/eh-steve/edk2-edkrepo-manifest ffs-nvidia nvidia
edkrepo clone nvidia-uefi NVIDIA-Jetson r35.4.1

the checkout works okay, and the UEFI bootloader may be built. May I suggest that this pull request is accepted into the NVIDIA repository?

I will repeat the same test with R35.4.1 bootloader, kernel and rootfs on Monday and report back how it went.

But back to the A/B issue.

It may be caused from you erase the whole rootfs.

Yes it is absolutely because I erased the whole rootfs. I did so to emulate a corruption of the filesystem in the field to test the robustness of the A/B failover. The corruption of a filesystem making it unmountable is absolutely possible in the field, and is exactly what an A/B failover mechanism should be robust enough to recover from. If this is not considered as a bug, then the A/B mechanism is fundamentally broken by design.

Can you tell me whether this scenario (corrupted filesystem) is something that NVIDIA will be fixing in the future? If not, our product (and any product with a Xavier NX) may fail in the field with no way to recover it. This is a serious risk for us because our product may not be serviceable once deployed.

We are using a Xavier NX with eMMC and NVMe as external storage.

Please use NVIDIA-Platforms instead of NVIDIA-Jetson in this command to clone UEFI source.

$ edkrepo clone nvidia-uefi NVIDIA-Platforms r35.4.1

I’ve checked this issue with internal that there’s bug about fail-over mechanism not work when the rootfs mount failed during boot up.
Please what for our later release to get it fixed.

Hello,

I was checking the release notes (https://docs.nvidia.com/jetson/archives/r36.2/ReleaseNotes/Jetson_Linux_Release_Notes_r36.2.pdf) for JetPack 6.0.0 Developer Preview and I couldn’t see anything around fail-over mechanism not work when the rootfs mount failed during boot up. (as per your confirmation of the bug on your message above).
I can see you are planning to do a final production release by March 2024 for JetPack 6.0.0.
As this is a production level feature quite important for reliable OTA upgrades on the field; would it be possible to know if the fail-over mechanism works when the rootfs mount failed during boot for JetPack 6.0.0?

Thank you.
Regards.

What we discuss above is for JP5, and it seems it has also been fixed for JP6.
Please wait for the JP6 GA release.