Hi,
We are in the process of deployng an IoT-style device with an NVIDIA Jetson Xavier NX. We are in the process of finalising the OTA upgrade process with this device.
To produce our update, we build a rootfs, kernel and bootloader (R35.3.1) and then use the NVIDIA tools to build a .BUP package. We are using an A/B partition layout, using NVMe as the primary storage device.
Generally, we have a working update system, using the NVIDIA tools to build and run the update. However, we have been doing some testing of the update process to ensure reliable in-the-field behaviour.
The result of this testing has thrown up a couple of issues that I’m hoping you may help with.
Failover not working as expected
To test that the A/B failover mechanism works as expected I ran a simple test. I programmed a device with our system image. Using nvbootctrl I then switched between the A/B partitions and rebooted, to ensure that both partitions are fully bootable. This worked fine. I then wanted to test that the A/B failover works correctly by destroying one of the rootfs partitions.
The block devices that are available on the running system are listed below:
user@instrument:~$ sudo blkid
/dev/nvme0n1p1: UUID="0b4b8c2d-ddda-4c36-83bd-3d9444d79f5b" TYPE="ext4" PARTLABEL="APP" PARTUUID="26dde615-02f0-4dc3-aeb1-11a790298e62"
/dev/nvme0n1p2: UUID="14ae0626-3b6d-4edf-8ae5-b9e06c38b4c9" TYPE="ext4" PARTLABEL="APP_b" PARTUUID="ed669db7-bee9-4cad-8b3f-ede9dc754a35"
/dev/nvme0n1p3: PARTLABEL="kernel" PARTUUID="50c893a6-ee2e-42c7-8e49-650743d0af50"
/dev/nvme0n1p4: PARTLABEL="kernel-dtb" PARTUUID="28eb2b0f-458b-4309-bc80-c046d91f8e58"
/dev/nvme0n1p5: PARTLABEL="reserved_for_chain_A_user" PARTUUID="603c7b37-42b9-465f-a7e9-e64de5ee7352"
/dev/nvme0n1p6: PARTLABEL="kernel_b" PARTUUID="14f793db-63dc-4d27-bf71-57165395bb43"
/dev/nvme0n1p7: PARTLABEL="kernel-dtb_b" PARTUUID="63f603ae-42ce-46b1-a222-bf181928a640"
/dev/nvme0n1p8: PARTLABEL="reserved_for_chain_B_user" PARTUUID="1a79c00c-d3fc-4c97-9ca6-e87c7b587145"
/dev/nvme0n1p9: PARTLABEL="recovery" PARTUUID="696b0312-f4f4-4a72-9216-ac24f90d7670"
/dev/nvme0n1p10: PARTLABEL="recovery-dtb" PARTUUID="118e8e0f-49ec-4f8e-b41e-cd50b5215762"
/dev/nvme0n1p11: PARTLABEL="RECROOTFS" PARTUUID="5256381a-6742-4832-b8f1-06332a63417b"
/dev/nvme0n1p12: UUID="5CE5-D962" TYPE="vfat" PARTLABEL="esp" PARTUUID="4b3bacce-7275-49c7-8383-cf530528782b"
/dev/nvme0n1p13: PARTLABEL="recovery_alt" PARTUUID="5026b52e-6cea-41b6-aa16-ec7d09491e65"
/dev/nvme0n1p14: PARTLABEL="recovery-dtb_alt" PARTUUID="1eddd0c6-8869-4443-9cde-d92874d4d302"
/dev/nvme0n1p15: PARTLABEL="esp_alt" PARTUUID="1af4cb38-00ff-4199-8dfc-7943448b6e35"
/dev/nvme0n1p16: UUID="23f3456a-0a05-411e-8971-d7011cd1607a" TYPE="ext4" PARTLABEL="UDA" PARTUUID="3e30d4fb-a32a-4062-bfe3-df7a0dd89b27"
/dev/mmcblk0p1: UUID="010bb079-6311-451f-b0dc-982085810677" TYPE="ext4" PARTLABEL="APP" PARTUUID="6419e22e-32a4-4e17-928a-0a45769f8b72"
/dev/mmcblk0p2: UUID="ad7ebab2-ccce-45ac-8144-20409cb1b18b" TYPE="ext4" PARTLABEL="APP_b" PARTUUID="167d1e89-ccf2-4270-9417-8b3a7687a940"
/dev/mmcblk0p3: PARTLABEL="kernel" PARTUUID="6f9a86e7-b24a-43ed-9035-1c677c65bd25"
/dev/mmcblk0p4: PARTLABEL="kernel-dtb" PARTUUID="28f8d7f2-7fbb-4e4f-978b-2c19be5de02d"
/dev/mmcblk0p5: PARTLABEL="reserved_for_chain_A_user" PARTUUID="3033a786-e8c9-4e62-b4ee-8b5058673205"
/dev/mmcblk0p6: PARTLABEL="secure-os_b" PARTUUID="0a5e9367-041a-4447-89bb-8c19c2efd777"
/dev/mmcblk0p7: PARTLABEL="eks_b" PARTUUID="334b5db4-97d2-4607-8384-3657bdb17d03"
/dev/mmcblk0p8: PARTLABEL="adsp-fw_b" PARTUUID="1b8cafa5-37b6-4c48-9f9d-0a37d391a67f"
/dev/mmcblk0p9: PARTLABEL="rce-fw_b" PARTUUID="7a5f6a5a-2871-4c15-8931-3272e388dc10"
/dev/mmcblk0p10: PARTLABEL="sce-fw_b" PARTUUID="5e85f563-489d-4cbd-9a10-86514b7c204e"
/dev/mmcblk0p11: PARTLABEL="bpmp-fw_b" PARTUUID="30aafae8-462a-48a2-87e1-dd73dad2a359"
/dev/mmcblk0p12: PARTLABEL="bpmp-fw-dtb_b" PARTUUID="16f1c5e5-6d5f-4d0a-9930-84076c6d2547"
/dev/mmcblk0p13: PARTLABEL="kernel_b" PARTUUID="4b6d5628-1ecd-4810-84d4-574c8fe9cb55"
/dev/mmcblk0p14: PARTLABEL="kernel-dtb_b" PARTUUID="0c5722e7-8fce-45e4-91d9-a34d9c80a23f"
/dev/mmcblk0p15: PARTLABEL="reserved_for_chain_B_user" PARTUUID="7bec27a0-5d95-44da-9932-204345d77817"
/dev/mmcblk0p16: PARTLABEL="recovery" PARTUUID="2122954b-cff8-4a2a-9869-1f17a5ff811b"
/dev/mmcblk0p17: PARTLABEL="recovery-dtb" PARTUUID="763ff869-9a62-4951-8888-5e2ccdedc554"
/dev/mmcblk0p18: PARTLABEL="RECROOTFS" PARTUUID="360ee2ff-98e2-4de4-986a-e622e7ddb966"
/dev/mmcblk0p19: UUID="5755-DAD7" TYPE="vfat" PARTLABEL="esp" PARTUUID="3686df0d-4bdf-46c4-82b0-5d40f2a4784d"
/dev/mmcblk0p20: PARTLABEL="recovery_alt" PARTUUID="23ceb93e-e15b-47e1-9e12-9e14670f3c6f"
/dev/mmcblk0p21: PARTLABEL="recovery-dtb_alt" PARTUUID="1ff20028-e723-40f5-b6f8-07450f23492c"
/dev/mmcblk0p22: PARTLABEL="esp_alt" PARTUUID="46da76f1-d248-42ab-aba3-eb6b919ec642"
/dev/mmcblk0p23: PARTLABEL="UDA" PARTUUID="37862fdd-d604-4f0b-9675-3f5a28c5a858"
/dev/loop0: SEC_TYPE="msdos" LABEL_FATBOOT="L4T-README" LABEL="L4T-README" UUID="1234-ABCD" TYPE="vfat"
With the device booted into the “A” partition:
user@instrument:~$ sudo nvbootctrl dump-slots-info
Current version: 35.3.1
Capsule update status: 1
Current bootloader slot: A
Active bootloader slot: A
num_slots: 2
slot: 0, status: normal
slot: 1, status: normal
user@instrument:~$ sudo nvbootctrl -t rootfs dump-slots-info
Current rootfs slot: A
Active rootfs slot: A
num_slots: 2
slot: 0, retry_count: 3, status: normal
slot: 1, retry_count: 3, status: normal
I then destroyed the “APP_b” partition by writing zeros to it:
user@instrument:~$ sudo dd if=/dev/zero of=/dev/nvme0n1p2 bs=1M
dd: error writing '/dev/nvme0n1p2': No space left on device
20481+0 records in
20480+0 records out
21474836480 bytes (21 GB, 20 GiB) copied, 61.7725 s, 348 MB/s
I then switched to the B partition with nvbootctrl so the machine will attempt to boot into the broken partition, and then rebooted.
My expectation would be that the bootloader attempts to boot the destroyed B partition, fails to do so, and then switches the boot chain to the A partition.
Instead, the L4TLoader attempts to read extlinux.conf from partition B, fails to find it and then falls back to booting into the recovery partition. This is not a very useful mechanism for us, because while booted into the recovery partition our OTA or remote access tools cannot correct the fault.
A copy of the serial console log during this process is attached:
fail_log2.txt (89.7 KB)
I modified the L4TLoader component of the UEFI bootloader to disable the fallback boot to the recovery partition. To do so, I commented out the following section at the end of Silicon/NVIDIA/Application/L4TLauncher/L4TLauncher.c:
// Not in else to allow fallback
if (BootParams.BootMode == NVIDIA_L4T_BOOTMODE_BOOTIMG) {
ErrorPrint (L"%a: Attempting Kernel Boot\r\n", __FUNCTION__);
Status = BootAndroidStylePartition (LoadedImage->DeviceHandle, BOOTIMG_BASE_NAME, BOOTIMG_DTB_BASE_NAME, &BootParams);
if (EFI_ERROR (Status)) {
ErrorPrint (L"Failed to boot %s:%d partition\r\n", BOOTIMG_BASE_NAME, BootParams.BootChain);
}
} else if (BootParams.BootMode == NVIDIA_L4T_BOOTMODE_RECOVERY) {
ErrorPrint (L"%a: Attempting Recovery Boot\r\n", __FUNCTION__);
Status = BootAndroidStylePartition (LoadedImage->DeviceHandle, RECOVERY_BASE_NAME, RECOVERY_DTB_BASE_NAME, &BootParams);
if (EFI_ERROR (Status)) {
ErrorPrint (L"Failed to boot %s:%d partition\r\n", RECOVERY_BASE_NAME, BootParams.BootChain);
}
}
While this change prevented the UEFI firmware from booting into the recovery partition, it instead switched to attempting to boot via the network, and hangs forever waiting for this. Still no closer to a working A/B failover.
What mechanism do I need to use to make the UEFI bootloader attempt to boot from only the NVMe disk, and perform an A/B failover in the case of a failed boot of one of the partitions?
To me it sounds like the mechanism described in Rootfs A/B redundancy fail-over mechanism in Jetpack5.1 relies on the Kernel panic/watchdog timer reset to mark the boot as a failure, but since we either get to a running recovery OS, or a forever stuck network boot, we will never actually failover properly.
This also looks to be a different behaviour to L4T 5.1 reboot loop after enabling watchdog with RootFS A/B - #11 by sanaurrehman because we do not boot forever - we just reboot once into the recovery OS.
Marking boot as successful
My second question is around the mechanism for marking a boot chain as "bootable or “damaged”. The documentation in Root File System — Jetson Linux Developer Guide documentation has the following statement:
If the current rootfs fails to boot a specified number of times, cpu-bootloader marks its Status attribute and switches the roles of the current and unused rootfs slots. If both root file systems are unbootable, the device tries to boot from the recovery kernel image.
What is the mechanism that cpu-bootloader uses to determine whether a boot chain is successful? Is it as simple as assuming that if a Kernel has started, the boot-chain is okay? From our point of view, the only “safe” point where we can mark a boot-chain as having booted properly is once the Linux userspace is fully started, and the third-party OTA management service we are using is running. Is it possible to have userspace tools perform this “marking successful” rather than cpu-bootloader?
Please let me know if there is any additional information that you need to help answer this question.