Redundant A/B rootfs not switching with set-active-boot-slot but working with set-SR-BR

hello diogojusten,

is it possible to Over-the-Air Update for moving to the latest JP-5 release version?

Hi @JerryChang,

It is possible to have a next image version with new JP5, but for those devices in field, where it require to be flashed via USB, is our last option as they in different countries, so not so easy.

I enabled EDK2 debug flag and ran my script that reboot the device multiples times and to OTA update (capsule update) every 5 times. I’ve the problem again. What caugth my eyes is the ASSERT_EFI_ERROR:

Jetson UEFI firmware (version v35.4.1 built on 2023-08-04T23:11:08+00:00)
ESC   to enter Setup.
F11   to enter Boot Manager Menu.
Enter to continue boot.
**********************************
**  WARNING: Test Key is used.  **
**********************************
**  WARNING: Test Key is used.  **
......PROGRESS CODE: V03051007 I0

ASSERT_EFI_ERROR (Status = Bad Buffer Size)
ASSERT [VariableRuntimeDxe] /home/diogo/work/distro/build/tmp/work/jetson_agx_xavier-oe4t-linux/edk2-firmware-tegra/35.4.1-r0/edk2-tegra/edk2/MdeModulePkg/Universal/Variable/RuntimeDxe/Variabl
e.c(3255): !(((INTN)(RETURN_STATUS)(Status)) < 0)

Also from the logs:

RecordVarErrorFlag (0xEF) BootChainFwNext:781E084C-A330-417C-B678-38E696380CB9 - 0x00000007 - 0x60
CommonVariableSpace = 0x1FF9C - CommonVariableTotalSize = 0x1FF94
ProcessEspVariable: Failed to Set variable BootChainFwNext Bad Buffer Size
GetAndProcessEspVariables: Failed to process BootChainFwNext-781e084c-a330-417c-b678-38e696380cb9

The efi variable VarErrorFlag that I mentioned before is set every boot:

RecordVarErrorFlag (0xEF) MTC:EB704011-1402-11D3-8E77-00A0C969723B - 0x00000007 - 0x48
CommonVariableSpace = 0x1FF9C - CommonVariableTotalSize = 0x1FF94
...
RecordVarErrorFlag (0xEF) PlatformConfigData:ED3374EF-767B-42FA-AF70-DB520A392822 - 0x00000003 - 0xEA
CommonVariableSpace = 0x1FF9C - CommonVariableTotalSize = 0x1FF94
PlatformConfigured: Error setting Platform Config data: Bad Buffer Size

hello diogojusten,

let me reply this…

actually not, it’s able to moving forward without USB connected via OTA update.

could you please also share the script, or, reproduce steps for us to test locally.

Hi @JerryChang,

Not really, as it isn’t switching slots when the problem occurs. It also isn’t updating the bootloader as EDK2 is “crashed”.

To reproduce the issue, I’m running shutdown to turn the board OFF and ignition signal is responsible to turn it ON again. Every boot multiple of 5 I’m running the OTA update by Mender, basically it is writing the rootfs into the standby partition, copying the capsule update file to /boot/efi/EFI/UpdateCapsule/ that is also mounted as /opt/nvidia/esp/EFI/UpdateCapsule/ and writing \x04\x00\x00\x00\x00\x00\x00\x00 to /boot/efi/EFI/NVDA/Variables/OsIndications-8be4df61-93ca-11d2-aa0d-00e098032b8c. This procedure is working and the device is updated. After few times around 150 boots, the problem is occuring.

First I change my PC to access the ECU via ssh without password and I’m running the script below, when VarErrorFlag is set as ef the script stops:

#!/bin/bash
TARGET_IP="${1-192.168.100.101}"
TARGET_USER="${2-user}"
TARGET_PASS="${3-password}"
count=0

TARGET="${TARGET_USER}@${TARGET_IP}"

while true; do
	count="$(expr ${count} + 1 )"
	echo -n "Connecting (${count}) "
	repeat=true
	while ${repeat}; do
		ssh -o ConnectTimeout=5 ${TARGET} 'efivar  -p -n 04b37fe8-f6ae-480b-bdd5-37d98c5e89aa-VarErrorFlag |grep ff' > /dev/null 2> /dev/null
		result=$?
		case ${result} in
			0)
				repeat=false
				;;
			255)
				#echo -n "."
				;;
			*)
				echo "Failed with code ${result} on count ${count}"
				exit ${result}
		esac
	done
	echo "Passed iteration ${count}"

    if (( count % 5 == 1 )); then
        echo "Mender commit"
	    ssh ${TARGET} 'sudo -S  mender-update commit' > /dev/null 2> /dev/null << EOF
${TARGET_PASS}
EOF
    fi

    if (( count % 5 == 0 )); then
        echo "Mender update"
	    ssh ${TARGET} 'sudo -S  mender-update install /media/nvme/as-image-humble-jetson-agx-xavier.mender' > /dev/null 2> /dev/null << EOF
${TARGET_PASS}
EOF
	echo -n "Waiting for host to go down "
	ssh ${TARGET} 'sudo -S reboot' > /dev/null 2> /dev/null << EOF
${TARGET_PASS}
EOF
    	while ping -w1c1 ${TARGET_IP} > /dev/null 2> /dev/null; do
    		echo -n .
    	done
    	echo "disconnected"
    else
        sleep 40
        echo Shutdown
    	ssh ${TARGET} 'sudo -S shutdown now' > /dev/null 2> /dev/null << EOF
${TARGET_PASS}
EOF
    	while ping -w1c1 ${TARGET_IP} > /dev/null 2> /dev/null; do
    		echo -n .
    	done
    	echo "disconnected"
    fi
done

hello diogojusten,

here’re some points need to confirm:

  1. you’re trigger the capsule update in the script after boot to kernel:
    (1) Can you check the esp is mount to /opt/nvidia/esp every time before copying the capsule payload and write the OsIndication variable?
    (2) Can you check whether the background service status is SUCCESS (sudo systemctrl status nv-l4t-bootloader-config) before triggering a capsule update?

  2. for issue debugging, we may need full log of the issue, we can not find any clue from current log snippet.

  3. are you able to try the latest r35.6.0 to check whether the issue still exists, although it’s hard to update device in field, but you may test with the device at hand.

Hi @JerryChang,

I suppose it is always being mounted, I’ll add a check for it in my script.

My image is based on Yocto build, so I don’t have nv-l4t-bootloader-config.service. My board is “eMMC only board”.

Here is the full UART logs, there are two boots log, the second is when the problem occured:
logs_mender_update.txt (253.2 KB)

I’ll update to r35.5.0 as this is the latest kirkstone branch from meta-tegra and test again.


Is there a way from Linux (user space) to delete efi variables? I’m suspecting that instead overwrite variables, it is always creating a new until the partition where the variables are stored is full, I mean this partition efivarfs on /sys/firmware/efi/efivars type efivarfs (ro,nosuid,nodev,noexec,relatime).

From these logs:

RecordVarErrorFlag (0xEF) PlatformConfigData:ED3374EF-767B-42FA-AF70-DB520A392822 - 0x00000003 - 0xEA
CommonVariableSpace = 0x1FF9C - CommonVariableTotalSize = 0x1FF58

It is trying to write a variable size 0xEA + CommonVariableTotalSize 0x1FF58 where it is higher then CommonVariableSpace 0x1FF9C, but my question is why the CommonVariableTotalSize is around 0x1FF58 when the error occurs?

BTW,
It’s strange about the following log:
ProcessEspVariable: Failed to Set variable BootChainFwNext Bad Buffer Size
GetAndProcessEspVariables: Failed to process BootChainFwNext-781e084c-a330-417c-b678-38e696380cb9

for Xavier series, when we set UEFI variable in kernel, then we write the variable file to esp.
so, for the BootChainFwNext variable, nvbootctrl set-active-boot-slot 1/0 will write BootChainFwNext-781e084c-a330-417c-b678-38e696380cb9 to esp partition. did you call nvbootctrl set-active-boot-slot 1/0 or, write the BootChainFwNext file to esp during the test flow?

When I’m running the OTA update I’m using this script to copy the capsule update file to esp partition and this script to write set OsIndications-8be4df61-93ca-11d2-aa0d-00e098032b8c. Next step I would expect edk2 doing the capsule update, but as mentioned it isn’t happening.

I enclose two more logs, one I’m just rebooting the device that has the slot switching issue.
The other log is from when I run nvbootctrl set-active-boot-slot 1. Only in that second log the message about BootChainFwNext is printed.
reboot.txt (126.9 KB)
set-active-boot-slot.txt (128.7 KB)

I just finished my test with 35.5.0 and the issue is happening too. Full logs with the issue:
logs35-5.txt (935.6 KB)

It seems the issue is about CommonVariableTotalSize, this variable start with a very small value like 0xD0F4 and it is continue being incremented until reaching the space define by CommonVariableSpace = 0x1FF9C. In the same case if I try to change anything in the UEFI boot menu, it isn’t allowing to save any change. That makes me thing again that the efi variable space is getting full and EDK2 isn’t able to change anything.

hello diogojusten,

we’ve tried to reproduce this issue locally, and, it turns out environment variable, ROOTFS_AB is crucial for both image flashing and BUP generation.

let’s double confirm the steps.
for instance,
(1) you should flash Jetpack release on AGX-Xavier with Rootfs-AB enabled
$ sudo ROOTFS_AB=1 ./flash.sh jetson-xavier mmcblk0p1
(2) you should also confirm you’ve added “ROOTFS_AB=1” when generating the BUP
$ sudo ROOTFS_AB=1 ./l4t_generate_soc_bup.sh t19x

Hi @JerryChang,
Yes, I’ve dual boot enabled and it works, including bootloader update.

Let me share the last test result. The issue isn’t happening when updating the system, but just by rebooting multiples times (around 370 times).
I added a debug message inside EDK2, exactly at line edk2/MdeModulePkg/Universal/Variable/RuntimeDxe/VariableNonVolatile.c at r35.5.0-edk2-stable202208 · NVIDIA/edk2 · GitHub. Below are the changes, I added the two lines with “DIOGO14” and have my build is done with EDK2_BUILD_MODE:pn-edk2-firmware-tegra = "DEBUG":

  //
  // Parse non-volatile variable data and get last variable offset.
  //
  Variable = GetStartPointer (mNvVariableCache);
  while (IsValidVariableHeader (Variable, GetEndPointer (mNvVariableCache))) {
    NextVariable = GetNextVariablePtr (Variable, mVariableModuleGlobal->VariableGlobal.AuthFormat);
    DEBUG ((DEBUG_ERROR, "DIOGO14 Variable: %s\n", GetVariableNamePtr (Variable, mVariableModuleGlobal->VariableGlobal.AuthFormat)));
    VariableSize = (UINTN)NextVariable - (UINTN)Variable;
    if ((Variable->Attributes & (EFI_VARIABLE_NON_VOLATILE | EFI_VARIABLE_HARDWARE_ERROR_RECORD)) == (EFI_VARIABLE_NON_VOLATILE | EFI_VARIABLE_HARDWARE_ERROR_RECORD)) {
      mVariableModuleGlobal->HwErrVariableTotalSize += VariableSize;
    } else {
      mVariableModuleGlobal->CommonVariableTotalSize += VariableSize;
      DEBUG ((DEBUG_ERROR, "DIOGO14 CommonVariableSpace = 0x%x - CommonVariableTotalSize = 0x%x\n", mVariableModuleGlobal->CommonVariableSpace, mVariableModuleGlobal->CommonVariableTotalSize));

    }

    Variable = NextVariable;
  }

Every time that the board reboots it is creating two new variables MTC and PlatformConfigData, but it isn’t overwriting the old variable, I mean, it is writing the new variable into a new memory allocation, the system works until the free space is gone. See the UART logs below:
diogo14.txt (15.3 KB)
While CommonVariableTotalSize is smaller than CommonVariableSpace, everything works, but after ~370 reboots, that variable free space is gone.

hello diogojusten,

am I understand correctly that things has changed after adding environment variable, ROOTFS_AB to both image flashing and BUP generation?
for instance, it’s now able to use nvbootctrl set-active-boot-slot 1/0 for switching root file system slots.

according to above, it looks like a new issue, memory leakage.
please create a new thread (with specific topic), let’s following-up with new discussion thread.
you may link with this topic as see-also.
thanks

Hi @JerryChang,
The issue isn’t related to ROOTFS_AB, even when running with a single partition, the issue is happening.

I created the new topic Possible UEFI memory leak and partition full and included the details there.
Once again, thank you for your help.

1 Like