Issue that OTA works on some Jetsons but not others

Hello,

I am testing OTA updates that we want to deploy to a few hundred Jetson but I am running into issues that some of the Jetsons fail to update for some reason. All the jetsons are identical Jetson Orin AGX, I have confirmed this by reading out eeprom on devices that work and those that don’t.

I initially thought maybe it was to do with UEFI firmware version as the OTA is for a jetpack 5.1.2 → 5.1.5 migration meaning UEFI 4.x->6.x but this doesn’t seem to be the issue as it works for some of the jetsons.

I have some logs of the Jetson on which it worked (serial output so userspace command line + post-reboot install) and from one where it didn’t (ssh userspace log + separate serial log post reboot).

I have not been able to distinguish anything noticeable between the logs. The Jetson’s that don’t work seem to be unable to reboot to the part of the OTA process that looks like

They then seem to attempt to boot three times and land in recovery once that fails. As far as I can tell the systems are identical and I cannot explain why some fail and some do not.

This is the log of the working Jetson
999_worked_serial3.log (246.0 KB)

And this is the logs of a non-working Jetson
357_userspace_not_working.log (10.6 KB)
357_didn_work_serial.log (75.7 KB)

The rootfs is slightly modified but nothing major, just some debians for userspace utilities and the command used to generate the OTA is

cd "/workspace/target_bsp/5.1.5/Linux_for_Tegra" && BASE_BSP=/workspace/base_bsp/5.1.2/Linux_for_Tegra TARGET_BSP=/workspace/target_bsp/5.1.5/Linux_for_Tegra ./tools/ota_tools/version_upgrade/l4t_generate_ota_package.sh jetson-agx-orin-devkit R35-4

The only difference I have been able to find is that the Jetson’s that don’t seem to work have

sudo nvbootctrl dump-slots-info
[sudo] password for cartken: 
Current version: 0.0.1
Capsule update status: 0
Current bootloader slot: A
Active bootloader slot: A
num_slots: 2
slot: 0,             status: normal
slot: 1,             status: normal

opposed to

sudo nvbootctrl dump-slots-info
[sudo] password for cartken: 
Current version: 35.4.1
Capsule update status: 0
Current bootloader slot: A
Active bootloader slot: A
num_slots: 2
slot: 0,             status: normal
slot: 1,             status: normal

on the working ones. I tried a capsule update as per the documentation which seems to fail as I believe 4.x UEFI doesn’t support this, I also attempted to update the UEFI which worked but caused the Jetson to get stuck in a bootloop due to an incompatibility between the rootfs and the UEFI I believe. Ideally I am after a solution that does not involve flashing via USB as we have a large fleet of Jetsons that need updating and this would be a very costly solution if each requires human intervention.

Scripts for the bootloader and UEFI updates are here, in case you’re after the process used
update_bootloader.txt (6.4 KB)
update_uefi.txt (3.6 KB)

I am not really sure what else could help so please let me know what information could help.

Cheers!

I have come across this post which seems to be somewhat appropriate but I am not sure I understand the process that is required. If this is indeed the issue I am facing could someone please make it a little clearer what needs to be done. We don’t encrypt anything and haven’t generated keys for anything in the past afaik.

hello alxhoff,

let me double check.. are you going to update UEFI firmware only?
could you please try Updating Jetson Linux with Image-Based Over-the-Air Update?

No the entire system with a custom rootfs based on the sample rootfs. The link you linked is what I followed but this only works with some of my Jetsons. I can’t explain why as they are all the same devices. On the Jetsons it works on it works great, just as I mentioned some of the Jetsons don’t seem to be able to complete the updating process after being rebooted and they land in a boot loop eventually landing in recovery mode.

I mentioned this as there seems to be similar output in the issue as I am not able to explain why it doesn’t work on some deviecs.

I have good news in that I have found a robot at my office where the update fails, until now the failing robots were all in different countries. I have attached a full serial log of it failing.
380_serial.log (180.8 KB)

hello alxhoff,

FYI, Image-based OTA update including updating rootfs and updating bootloader. updating rootfs is before updating the bootloader.
according to your logs, it looks you’ve complete updating rootfs, and then, the device reboots, the UEFI updates the bootloader through UEFI capsule update.

hence..
the error is updating bootloader has failed.
you’ll need to create an UEFI binary with debug print, applying it to your target to gather more details.

Where is information on how to do this?

Or anything else I could try, eg. updating a specific partition?

hello alxhoff,

please see-also Home · NVIDIA/edk2-nvidia Wiki · GitHub to rebuild UEFI binary.
you may refer to developer guide, Flashing a Specific Partition, you’ll need to flash UEFI bootloader (A_cpu-bootloader) on Jetson AGX Orin series.

Thanks :)

I have also just managed to recreate some behavior and maybe this has been experienced before. So we have a L4T directory with the partition images precompiled and we flash via flash.sh with -r. Now this morning I have been able to recreate the behavior by flashing with a “flashing station” that we have in our server room which is a NUC mini-pc running Ubuntu 22.04.4 and this seems to put the Jetson into this state where the OTA fails. Flashing the identical set of images from my Thinkpad running 20.04 and it gets the Jetson able to install OTA updates.

Has this every come up before that for some random reason a certain machine seems to corrupt the bootloader during flashing?

This is not directly relevant but a month ago was trying to l4t_initrd_flash.sh --no-flash --massflash 5 …

And it failed from both 22.04 and 24.04 host. So I thought maybe a clean environment might help so I

sudo mv JetPack_6.2_Linux_JETSON_AGX_ORIN_TARGETS original_JetPack_6.2_Linux_JETSON_AGX_ORIN_TARGETS

Then put agx orin into recovery mode, started sdkmanager and let it download and create JetPack_6.2_Linux_JETSON_AGX_ORIN_TARGETS/Linux_for_Tegra

Then was able to successfully run and complete

sudo BOARDID=3701 FAB=501 BOARDSKU=0005 BOARDREV=G.0 CHIP_SKU=00:00:00:D0 ./tools/kernel_flash/l4t_initrd_flash.sh --no-flash --network usb0 --massflash 5 jetson-agx-orin-devkit mmcblk0p1

hello alxhoff,

we’ll need you to apply debug version UEFI binary to root cause the failure.

Hi so an update here.

It would seem that this v0.0.1 version I am seeing is the factory bootloader. For some reason our Intel NUCs cannot properly flash the bootloader (and occasionally a laptop from a colleague). My laptop and I have never had issues flashing. We have confirmed that it is caused by the machine itself. We have tried each combination of our system images, generating directly from rootfs and different Ubuntu LTS versions, and the common denominator is simply that certain machines cannot flash the bootloader, it seems. On these forums people have had issues before with NUCs specifically but in our experience other common laptops can also have this issue.

Is there any internal knowledge of this issue, and ideally, a fix? As this is super awkward for us, as we have technicians around the world whose laptops cannot flash the bootloader, meaning that the robot is unable to be updated OTA, which is a pretty huge issue for us.

hello alxhoff,

it may due to your host does not have zlib1g-dev python library installed.
please give it a try to install related python library before image flashing.

would this explain why the exact same OS doesn’t work on one machine but does on another?

I have checked, the host that cannot flash has this installed.

ii zlib1g-dev:amd64 1:1.2.11.dfsg-2ubuntu1.5 amd64 compression library - development

hello alxhoff,

may I have more details about the host that cannot flash.
for instance, is it running with native ubuntu-OS? or, it’s using virtual machine?

I will get a comprehensive list of the machines we know work and the ones that dont and get back to you. it’s about 50-50 I believe in non-working machine to working in our office. And I don’t think any machine is the same.

They are all native ubuntu-OS.

Wonder if it would be reasonable to update ota_tools and trying
nv_ota_start.sh to see if that improves percentage successfully updated?

ls -R nvidia/nvidia_sdk/JetPack_6.2_Linux_JETSON_AGX_ORIN_TARGETS/Linux_for_Tegra/tools/ota_tools

./old_version_upgrade:
init nv_ota_disk_enc.func nv_ota_internals.sh nv_ota_utils.func ota_make_recovery_img_dtb.sh
nv_ota_common.func nv_ota_exception_handler.sh nv_ota_log.sh nv_recovery.sh recovery_copy_binlist.txt

./version_upgrade:
demo_host_ota_uefi_sb.sh l4t_ota_sign_enc_uefi_base.sh nv_ota_log.sh nv_ota_update_alt_part.func nv_update_alt_part.sh
demo_target_ota_uefi_sb.sh nv_ota_common.func nv_ota_preserve_data.sh nv_ota_update_implement.sh ota_backup_files_list.txt
Image_based_OTA_Examples.txt nv_ota_common_utils.func nv_ota_rootfs_updater.sh nv_ota_update_rootfs_in_recovery.sh ota_board_specs.conf
l4t_generate_ota_package.sh nv_ota_customer.conf nv_ota_run_tasks.sh nv_ota_update.sh ota_multi_board_specs.sh
l4t_gen_uefi_sb_overlay.sh nv_ota_decompress_package.sh nv_ota_start.sh nv_ota_validate.sh ota_validate_params.sh

curl --output ota_tools.tbz2 https://developer.download.nvidia.com/embedded/L4T/r36_Release_v4.3/release/ota_tools_R36.4.3_aarch64.tbz2

Thanks for the suggestion @whitesscott. This might hep further down the line but for now our issue is that flashing using flash.sh doesn’t always seem to replace the factory bootloader (v0.0.1) which means that there isn’t a bootloader capable of handling the OTA update process.

Do you know if the OTA tools replace the bootloader before running the install? and if so do they require a non-factory (v0.0.1) bootloader?