Possible UEFI memory leak and partition full

Hi All,

some devices with AGX Xavier were not able to run OTA updated after some time, the first investigation was done at Redundant A/B rootfs not switching with set-active-boot-slot but working with set-SR-BR. Where the problem was about system(EDK2) not switching slots.

After further investigation, it seems the partition where UEFI variables are stored, is getting full after rebooting the board around 370 times. Only by running reboot command multiple times.
Every time that the device boot, it create two new variables MTC and PlatformConfigData, it seems to not remove or overwrite the old variables and it is allocating a new space to store the new values.

I started do debug EDK2 and EDK2-nvidia by adding a debug message at this line in order to print CommonVariableSpace where it is always 0x1FF9C and CommonVariableTotalSize, where it has value zero after flashing the board with USB and increase the size every boot.
When the value reach 0x1FF9C it set VarErrorFlag to 0xef. At this time, the system(EDK2) never switch the slots again, nvbootctrl set the runtime variable, but EDK2 isn’t able to do any operation with the variables as it fail to save because memory region is full.

Also during my testes, MAYBE ( I need your confirmation), it could be okay to keep the old variables as it is marking them with State &= VAR_DELETED; at this file and later it runs a garbage collection by calling Reclaim() function.
The issue is that when running FvbWrite(), it sets the variable to LbaBoundaryCrossed = TRUE and return EFI_BAD_BUFFER_SIZE.

I enclose a full log with three boots, the first two it booted okay and last one the error started to occur. After that time EDK2 isn’t able to do any OTA update or even change any variable.
error_with_logs.txt (565.1 KB)

Do you have any idea about? Is it a memory leak? Is it a garbage collector issue?
Thank you in advance for your help.

Hi diogojusten,

Are you using the devkit or custom board for AGX Xavier?
What’s your Jetpack version in use?
What’s the target version for OTA?

Do you mean that some AGX Xavier work as expected but some AGX Xavier not work?

Please share the steps how you perform OTA for us to verify on the devkit.

Hi @KevinFFF,

Answering your questions:

I’ve a custom board with exactly 2888-400-0004-P.0-1-2 Nvidia module.

I tested with JP5.1.2 and JP5.1.3. In both cases I had the issue.

What is exactly this “target version”?. Would does it be the module version 2888-400-0004-P.0-1-2?

The issue isn’t related with OTA. Everything works file including OTA update (before the issue happens). After that, it doesn’t allow to switch Slots and therefore the OTA stops working. There are two ways to simulate the issue.

  1. Rebooting the target around 370 times
  2. Running ~130 OTA update.
    After that time EDK2 start to set 04b37fe8-f6ae-480b-bdd5-37d98c5e89aa-VarErrorFlag to 0xEF and EDK2 isn’t able to switch the Slots anymore.
    In case you want to add the debug message to monitor CommonVariableTotalSize, I apply that DEBUG message as described in this link

I use Mender as a system update, it write the rootfs to the stand-by partition, copy the capsuleUpdate file into ESP partition and set bit2 using oe4t-set-uefi-OSIndications. Just to avoid any miscommunication, only by rebooting the board multiple times, I see the issue, even with ROOTFS_AB disabled.