We got a fleet of > 1000 AGX Xaviers currently on JP4 with A/B rootfs enabled that we need to upgrade to JP5.
We already successfully remotely upgraded many of them from a 32.6.1 to a 32.7.1 BSP through OTA A/B upgrade. Now we need to upgrade to BSP 35.3.1 or 35.4.1 (the carrier board manufacturer only supports 35.3.1, we might do our own 35.4.1 support if time permits). They are all at remote locations and we cannot get physical access to them, only remote, that’s why a robust A/B mechanism with rollback is necessary.
Currently we unsuccessfully experimented doing an OTA upgrade on a AGX Xavier DevKit to validate the concept.
There are 2 problems:
The xavier accepts the ota payload and tries to reboots into recovery mode to try to install it. It fails to enter recovery mode (init not found) and thus is stuck and rendered useless until entered into forced recovery mode and reflashed from scratch.
The reboot into recovery is not acceptable and not compatible with our update software and our robustness and rollback requirements. We need to flash the inactive partition from a running system and be able to reboot directly into the new system, without going through recovery mode. This works fine between different or identical JP4 rootfs upgrades as well as between different or identical JP5 rootfs upgrades. We need the same behavior for a JP4 to JP5 upgrade. The problem is that going from JP4 to JP5 there is a layout change. Would it be possible to change the JP5 layout to fit the JP4 layout to support an upgrade without layout change ? Most partitions are the same but shuffled in a different order. I must say the JP5 layout looks more sane and future proof regarding future upgrades.
Are you using AGX Xavier with rootfs in internal eMMC or external NVMe?
There must be layout change from JP4 to JP5 since the SW architecture and stack are different.
I would suggest you verify the overall process on the devkit first.
Could you share the log when you are generating the OTA package and also performing the OTA update?
That’s very unfortunate as it defeats the whole A/B purpose to have a robust and fail safe upgrade path including rollback.
Currently for JP4→JP4 as well as JP5→JP5 upgrades we have a very robust process. If anything during the flashing fails the running partition is not altered and the system continues to work.
If the flash succeeds but for some reason doesn’t boot, the rollback mechanism brings back a sane state with everything running in the previous version.
Currently it fails during the recovery step, leaving the whole system bricked until reflashed from scratch. Rollback doesn’t work.
That’s exactly what we are doing now as the carrier board manufacturer doesn’t support OTA upgrades at all and there might be additional surprises (we solved them all for JP4→JP4 upgrades already).
Yes, i’ll regenerate a clean OTA upgrade begin of next week and post the generation logs as well as the logs when applying the upgrade.
The initial issue of failing to boot into recovery due to an initramfs was due to an error in the build_base_recovery_image.sh arguments, one path that should have been base_bsp was target_bsp.
As we only did JP4 to JP4 and JP5 to JP5 upgrades with A/B we never used the recovery image so the error was unnoticed until now (the rootfs and ota package generation is automated). Fixing that leads to a booting recovery.
Now we get stuck at
/init: line 68: modprobe: command not found[ 7.605782] Root device found: initrd
[ 7.615564] hpd: switching from state 1 (Check Plug) to state 3 (Disabled)
[ 7.617736] Mount initrd as rootfs and enter recovery mode
Finding OTA work dir on external storage devices
Checking whether device /dev/mmcblk?p1 exist
Looking for OTA work directory on the device(s): /dev/mmcblk0p1
Checking whether device /dev/sd?1 exist
Device /dev/sd?1 does not exist
Checking whether device /dev/nvme?n1p1 exist
Looking for OTA work directory on the device(s): /dev/nvme0n1p1
mount /dev/nvme0n1p1 /mnt
[ 7.679236] EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: (null)
is_boot_only_partition /mnt
OTA work directory /mnt/ota_work is not found on /dev/nvme0n1p1
Finding OTA work dir on internal storage device
mount /dev/mmcblk0p1 /mnt
[ 7.745623] EXT4-fs (mmcblk0p1): mounted filesystem with ordered data mode. Opts: (null)
is_boot_only_partition /mnt
OTA work directory /mnt/ota_work is not found on /dev/mmcblk0p1
OTA work directory is not found on internal and external storage devices
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
bash-4.4#
bash-4.4#
The error seems to come from the fact that in our case mmcblk0p1/2’s /ota_work is a symlink to a subdir of nvme0n1p1.
As we are running A/B Rootfs mmcblk0p1/2 is only half the size and too small to hold the upgrade. Thus we used a symlink.
From the logs it seems that the upgrade script looks both at the root of mmcblk0p1 and nvme0n1p1, so we’ll look if there is a way to specify a subfolder to look into on nvme0n1p1 rather than the root.
Otherwise i’ll patch the update code to use an ota_work folder at the root of nvme0p1.
Then i’ll see how far we come.
As now i have a shell in the recovery partition, is there a way from there to reboot to the current running system (boot on mmcblk0p1) ? I tried to set slot 0 as bootable through nvbootctrl but it is not available in the recovery system. As the upgrade has not yet been applied it would be nice to be able to rollback to the working system instead of reflashing it.
Image-based OTA will perform for unused slot currently.
For example, it you are booting from slot A, and it will update slot B after reboot and boot from slot B after update. It seems you have external NVMe connected so that you could just boot from NVMe and put your OTA update payload into it before update.
You are in recovery kernel. Please remove the OTA payload and run reboot to UEFI menu and select booting from NVMe drive.
Yes, we have an NVMe but only for shared and persistent data between A/B. We don’t boot on NVMe.
We’d like to reboot on slot 0 (or slot 1) of the internal storage but it always goes into recovery:
[0008.654] I> ########## Fixed storage boot ##########
[0008.659] I> Loading kernel-bootctrl from partition
[0008.664] I> Loading partition kernel-bootctrl at 0xa0700000 from device(0x1)
[0008.677] I> A/B: bin_type (50) slot 0
[0008.677] I> Loading recovery from partition
[0008.679] I> Loading partition recovery at 0xa0700000 from device(0x1)
[0009.071] I> Validate recovery ...
[0009.071] I> T19x: Authenticate recovery (bin_type: 50), max size 0x5000000
[0009.503] I> Encryption fuse is not ON
[0009.519] I> Checking boot.img header magic ... [0009.520] I> [OK]
[0009.520] I> A/B: bin_type (51) slot 0
[0009.520] I> Loading recovery-dtb from partition
[0009.520] I> Loading partition recovery-dtb at 0x91000000 from device(0x1)
[0009.528] I> Validate recovery-dtb ...
[0009.528] I> T19x: Authenticate recovery-dtb (bin_type: 51), max size 0x400000
[0009.532] I> Encryption fuse is not ON
[0009.533] I> Kernel hdr @0xa0700000
[0009.533] I> Kernel dtb @0x91000000
[0009.536] I> decompressor handler not found
[0009.540] I> Copying kernel image (34484232 bytes) from 0xa0700800 to 0x80080000 ... [0009.556] I> Done
[0009.556] I> Move ramdisk (len: 12902618) from 0xa27e4000 to 0x92000000
[0009.561] I> Updated bpmp info to DTB
[0009.562] I> Ramdisk: Base: 0x92000000; Size: 0xc4e0da
[0009.564] I> Updated initrd info to DTB
[0009.568] W> WARN: Fail to override "console=none" in commandline
[0009.574] I> Active rootfs suffix:
[0009.577] E> tegrabl_linuxboot_add_disp_param, du 0 failed to get display params
[0009.585] E> tegrabl_linuxboot_add_disp_param, du 0 failed to get display params
[0009.592] E> tegrabl_linuxboot_add_disp_param, du 0 failed to get display params
[0009.599] I> Active slot suffix:
[0009.602] I> add_boot_slot_suffix: slot_suffix =
[0009.607] I> Linux Cmdline: console=ttyTCU0,115200 root=/dev/initrd rw rootwait console=ttyTCU0,115200n8 fbcon=map:0 net.ifnames=0 video=tegrafb no_console_suspend=1 earlycon=tegra_comb_uart,mmio32,0x0c168000 b
ase_version=R32-6 target_board=jetson-agx-xavier-devkit video=tegrafb earlycon=tegra_comb_uart,mmio32,0x0c168000 gpt rootfs.slot_suffix= usbcore.old_scheme_first=1 tegraid=19.1.2.0.0 maxcpus=8 boot.slot_suffix=
boot.ratchetvalues=0.4.2 vpr_resize sdhci_tegra.en_boot_part_access=1
What do you mean by remove OTA payload and how to do it ?
In the menu I select booting from emmc but it still boots into the emmc’s recovery partition and not into emmc’s slot 0.
Current boot into recovery, after having changed the /ota_work symlink to the root of /dev/nvme1n1p1 instead of a subfolder:
[ 7.674774] Mount initrd as rootfs and enter recovery mode
Finding OTA work dir on external storage devices
Checking whether device /dev/mmcblk?p1 exist
Looking for OTA work directory on the device(s): /dev/mmcblk0p1
Checking whether device /dev/sd?1 exist
Device /dev/sd?1 does not exist
Checking whether device /dev/nvme?n1p1 exist
Looking for OTA work directory on the device(s): /dev/nvme0n1p1
mount /dev/nvme0n1p1 /mnt
[ 7.736020] EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: (null)
is_boot_only_partition /mnt
Set rootfs=/dev/nvme0n1p1
Set dm_crypt=
OTA task runner nv_ota_run_tasks.sh is not found
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
bash-4.4#
It seems to look for a nv_ota_run_tasks.sh under /mnt/ota_work. There must be a issue with my payload as my /mnt/ota_work folder has
bash-4.4# ls /mnt/ota_work/
Linux_for_Tegra ota_work
Under there there are in fact 2 nv_ota_run_tasks.sh:
Could you perform OTA update from R32.6.1 to R35.4.1 on the devkit (for eMMC) to verify the overall workflow for image-based OTA?
Currently, image-based OTA for NVMe is not supported for AGX Xavier yet and it would be supported from next release (might be R35.5.0).
Yes, that’s exactly what we did. All our AGX Xavier run only on eMMC and all current OTA tests are done on a DevKit. The SSD is only for persistent storage, not for the OS.
The only thing related to the NVMe on this issue is that the /ota_work folder is on the NVMe and not on the eMMC due to not enough eMMC flash size.
Any estimation when 35.5.x will be released, it was originally announced for December 2023 ?
Set rootfs=/dev/nvme0n1p1 comes from the log output from your tools.
For me it just means that your tool successfully found the ota_work directory on nvme0n1p1 which is correct and expected. Beside that nothing related to the system on the NVMe.
Yes we create the symlink from /ota_work on the eMMC to /ota_work on the NVMe. This is required as with A/B enabled we only have half of the eMMC size which is too small to hold the rootfs + the payload of the new rootfs.
No, eMMC is not an option due to the size. Everything works fine with JP4 to JP4 upgrades as well as JP5 to JP5 upgrades. There must just be a little detail to figure out why JP4 to JP5 doesn’t work.
Seems i finally found the issue, due to some error in our update scripts the ota payload ended up unpacked under /ota_work/ota_work on the NVMe instead under /ota_work. That didn’t cause any troubles for JetPack 4.x to JetPack 4.x or JetPack 5.x to JetPack 5.x updates (both without layout change) but fails for JetPack 4 to JetPack 5 due to the layout change and the intermediate reboot into recovery.
If your OTA payload is larger than the rootfs partition on your board, you may also get not enough space for it to be flashed. In your case, I would suggest flashing NVMe and use it as rootfs for more storage.
For Jetpack 4 to Jetpack 5, it includes the partition layout change.
Thanks for your reply. The OTA payload is not too big to be flashed onto the partition, just the flash is not big enough to hold the currently installed rootfs + the OTA payload of the new rootfs to flash.
Thanks for guiding us into the correct directions, indeed the problem came from the /ota_work/ symlink and the underlying structure.
First error was not to have the /ota_work symlink pointing to the root of the NVMe but to a subfolder. The 2nd error was that even when on the root of the NVMe the extracted payload was in a subdirectory. this works fine for upgrades without layout change but won’t work for an upgrade with layout change.
Once the /ota_work symlink properly points to a ota_work directory at the root of the NVMe and the payload is directly at the root of this folder it works fine applying the update and reboots into JetPack 5.
Now we face a new issue with partition B’s extlinux.conf containing PARTUUID of partition A as root. I’ll open a new thread for this specific issue here.
Thanks for your guidance to identify the problems so far.