[36.4.0] UEFI A/B capsule update bootloop after power loss

Dear NVIDIA support,

We encountered an issue during UEFI capsule A/B update if the jetson loses power during the bootloader update.
What we do during the update and to reproduce:

  • set_efi_var “RootfsStatus${other_slot}” “781e084c-a330-417c-b678-38e696380cb9” “\x00\x00\x00\x00”
  • Copy capsule file to /boot/efi/EFI/UpdateCapsule
  • set_efi_var “OsIndications” “8be4df61-93ca-11d2-aa0d-00e098032b8c” “\x04\x00\x00\x00\x00\x00\x00\x00”
  • reboot
  • cut power during bootloader update (–> other bootslot is corrupt now)
  • power on again
  • system rolls back to previous boot slot (which is ok)

When we now reboot from software (sudo reboot), we end up in a continuous boot loop. Logs are attached.
How do we escape the bootloop after rollback?
What is the intended way to do fail safe update? Is what we do the correct way?

We run our ORIN on Jetson Linux 36.4.0. Our rootfs is on external NVME. Apart from the failure case with powerloss we can reliable perform A/B updates

switching_boot_chain.log (103.7 KB)

*** Please note that this reply is generated by LLM automatically ***

Based on the provided context, I’ll address the questions and concerns raised by the user.

Issue Description:
The user is experiencing an issue with UEFI capsule AB update on Jetson AGX Orin. When the device loses power during the bootloader update, the other boot slot becomes corrupt. After a rollback, the system enters a continuous boot loop when rebooting from software.

Reproduction Steps:

  1. Set RootfsStatus to otherslot using setefivar.
  2. Copy the capsule file to boot/efi/EFI/UpdateCapsule.
  3. Set OsIndications to the specified value using setefivar.
  4. Reboot the device.
  5. Cut power during the bootloader update.
  6. Power on again, and the system rolls back to the previous boot slot.
  7. Reboot from software using sudo reboot, resulting in a continuous boot loop.

Escaping the Boot Loop:
To escape the boot loop after a rollback, you can try the following:

  1. Check the Capsule update status using the nvbootctrl command. If the status is not successful (value 1), you may need to retry the update process.
  2. Verify that the RootfsStatus is set to the correct value (either otherslot or currentslot) using setefivar.
  3. Ensure that the OsIndications variable is set correctly to trigger the update process.
  4. If the issue persists, try resetting the OsIndications variable to its default value (0x00) and then retry the update process.

Intended Way to Perform Fail-Safe Update:
The provided context does not explicitly describe the intended way to perform a fail-safe update. However, based on the information provided, it appears that the user is following a correct approach:

  1. Setting RootfsStatus to otherslot to indicate that the update should be applied to the other boot slot.
  2. Copying the capsule file to the boot/efi/EFI/UpdateCapsule directory.
  3. Setting OsIndications to trigger the update process.

However, it is essential to note that the user is using an external NVMe drive for the root file system, which may introduce additional complexity to the update process.

Comparison with the Provided Context:
The provided context describes the process for generating a specified partition BUP image and updating the bootloader using a capsule payload. The user’s approach seems to be consistent with this process, but the issue arises when the device loses power during the update process.

To minimize the risk of corruption and ensure a successful update, it is recommended to:

  1. Use a reliable power source during the update process.
  2. Verify the Capsule update status after the update process to ensure it was successful.
  3. Regularly back up the root file system to prevent data loss in case of a failure.

In summary, the user’s approach seems to be correct, but the issue arises due to the device losing power during the update process. To escape the boot loop, the user can try resetting the OsIndications variable and retrying the update process. To perform a fail-safe update, it is essential to follow the recommended process and take necessary precautions to minimize the risk of corruption.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

Hi

Thanks for confirming were on the right track. Is this a UEFI issue then?

hello markusta1,

you may have an interrupt when you see below message for entering UEFI menu.

  Jetson UEFI firmware (version v36.4.0 built on 2024-10-01T15:28:28+00:00)

��I/TC: Reserved shared memory is disabled
I/TC: Dynamic shared memory is enabled
I/TC: Normal World virtualization support is disabled
I/TC: Asynchronous notifications are disabled
I/TC: WARNING: Test UEFI variable auth key is being used !
I/TC: WARNING: UEFI variable protection is not fully enabled !
��


Jetson System firmware version v36.4.0 date 2024-10-01T15:28:28+00:00
ESC   to enter Setup.
F11   to enter Boot Manager Menu.
Enter to continue boot.
......

Thanks for the response.

What kind of interrupt do you mean? Where how to handle it? This should all be done by UEFI right, because at this point we only see logs from there. And currently we’re using stock 36.4.0 UEFI version.

How exactly does the interrupt influence getting stuck in bootloop?

hello markusta1,

it’s keyboard events for entering UEFI menu.

Bootloop also happens when no serial console is connected, I just connected it to gather logs such that I can have a nice descriptive error which I can post here :)

@JerryChang any updates to this topic?

hello markusta1,

JetPack 6.2.2/r36.5.0 is available now, please moving forward to the latest JP-6 release version for verification.
besides.. please setup serial console to gather complete UART logs for details.