Redundant A/B rootfs not switching with set-active-boot-slot but working with set-SR-BR

Hi All,
I’m running AGX Xavier board and randomly the system reached a strange status. Currently the rootfs is running on Slot 0:

root@jetson-agx-xavier:/# nvbootctrl  get-current-slot
0

I’m able to update the standby slot properly but the system isn’t booting up from Slot1:

root@jetson-agx-xavier:/# nvbootctrl set-active-boot-slot 1
root@jetson-agx-xavier:/# reboot 
...
root@jetson-agx-xavier:/# nvbootctrl  get-current-slot  
0

As you can see it continues booting from Slot 0, but when I use set-SR-BR the board is booting properly from Slot1, of course it rollback to Slot0 after a cold reboot.

root@jetson-agx-xavier:/# nvbootctrl set-SR-BR 1
root@jetson-agx-xavier:/# reboot 
...
root@jetson-agx-xavier:/# nvbootctrl  get-current-slot
1

Do you have any idea what could be happening in this system? In case I flash the same image again using the USB cable, everything works as expected, I don’t know what leads to that specific condition. It isn’t the first time that happened and couldn’t track why yet. No config changes were done, only multiples shutdown via command line.
Thank you in advance.
BR, Diogo

Hello @diogojusten,

What do you get when you request the slots info ?

nvbootctrl dump-slots-info

regards,
Andrew
Embedded Software Engineer at ProventusNova

This is what I get:

jetson-agx-xavier:/# nvbootctrl dump-slots-info
Current version: 35.4.1
Capsule update status: 0
Current bootloader slot: A
Active bootloader slot: A
num_slots: 2
slot: 0,             status: normal
slot: 1,             status: normal
jetson-agx-xavier:/# nvbootctrl dump-slots-info -t rootfs
Current rootfs slot: A
Active rootfs slot: A
num_slots: 2
slot: 0,             retry_count: 3,             status: normal
slot: 1,             retry_count: 3,             status: normal

When setting the active slot with set-active-boot-slot 1 it shows the active slot B, but after rebooting it goes back to A.
When using set-SR-BR 1, rebooting the board, the dump shows all correct B slot as current and active. Again, it only works with warm reboots in case a cold reboot is done, the system rollback to slot A.

@diogojusten,

Thanks for getting back with more details.

That is certainly interesting.

Would it be possible for you to connect to the board through UART and see logs while booting after changing the active slot?

regards.
Andrew
Embedded Software Engineer at ProventusNova

Unfortunately I have only remote access to that specific device as it is in field.
The only thing that catches my eyes is the VarErrorFlag UEFI variable that has the the value ef and all other devices that I have has the value ff. Honestly, I don’t know if it has any relation to the slot switching.

jetson-agx-xavier:/# efivar  -p -n 04b37fe8-f6ae-480b-bdd5-37d98c5e89aa-VarErrorFlag
GUID: 04b37fe8-f6ae-480b-bdd5-37d98c5e89aa
Name: "VarErrorFlag"
Attributes:
        Non-Volatile
        Boot Service Access
        Runtime Service Access
Value:
00000000  ef                                                |.               |

Bests,
Diogo

@diogojusten,

Got it, yeah, that makes it a bit more complex to debug.

I was reading through the documentation and according to the following diagram, it seems like when the change is applied to the scratch register, it works, but not when written into BR-BCT.

There is also an update BR-BCT flag in the scratch register, but no idea on how to set that up:

Maybe you can try booting into slot 1 using set-SR-BR and then marking it as successful. Or may marking slot 0 as unbootable ?

regards,
Andrew
Embedded Software Engineer at ProventusNova

Hi @proventusnova, thank you for sharing your thoughts.
I don’t think it would be an issue with cold boot specifically because I can do the following:

nvbootctrl set-SR-BR 1
reboot
# system boots from correct Slot1

or

nvbootctrl set-active-boot-slot 1
reboot
# system boots from WRONG Slot0

In both cases I’m doing a warm reboot.

I’m not trying to set any partition manually as unbootable in order to avoid bricking the device.


Another observation about nvbootctrl, set-SR-BR is writing to /dev/mem and set-active-boot-slot create the file /opt/nvidia/esp//EFI/NVDA/Variables/BootChainFwNext-781e084c-a330-417c-b678-38e696380cb9

strace nvbootctrl set-active-boot-slot 1
....
faccessat(AT_FDCWD, "/opt/nvidia/esp//EFI/NVDA/Variables", R_OK) = 0
faccessat(AT_FDCWD, "/opt/nvidia/esp//EFI/NVDA/Variables/BootChainFwNext-781e084c-a330-417c-b678-38e696380cb9", R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/opt/nvidia/esp//EFI/NVDA/Variables/BootChainFwNext-781e084c-a330-417c-b678-38e696380cb9", O_RDWR|O_CREAT|O_TRUNC, 0666) = 3
newfstatat(3, "", {st_mode=S_IFREG|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0
write(3, "\7\0\0\0\1\0\0\0", 8)         = 8
...


####


strace nvbootctrl set-SR-BR 1
...
openat(AT_FDCWD, "/etc/nv_boot_control.conf", O_RDONLY) = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=240, ...}, AT_EMPTY_PATH) = 0
read(3, "TNSPEC 2888-400-0004-P.0-1-2-sys"..., 4096) = 240
close(3)                                = 0
openat(AT_FDCWD, "/dev/mem", O_RDWR|O_SYNC) = 3
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0xc390000) = 0xffffa4a3d000
...

It seems like MB1 isn’t able to read the UEFI variable, but I can’t confirm it without having access to the UART logs.

hello diogojusten,

let me double confirm which Jetpack/L4T release version you’re working with,
you may check release tag for confirmation, i.e. $ cat /etc/nv_tegra_release

besides,
did you tried to corrupt root file system for testing? what’s your steps in details for switching RootFS slot?
you may also gather complete booting logs for cross-checking, thanks

Hi @JerryChang , thank you for your message.
I’m using JetPack 5.1.2, L4T release 35.4.1.
I didn’t try to corrupt the file system as the device is in field and I don’t have easy physical access to it, that makes a bit difficult to get the UART logs.

To switch Slots I’ve tried three different ways:

  • nvbootctrl set-active-boot-slot N && reboot. This isn’t switching, in the first device it is running Slot A and I’m trying to switch to B, but it isn’t switching. In a second device it is running Slot B, same thing happens, it continue running same slot B after nvbootctrl set-active-boot-slot 0 && reboot. A after set-active-boot-slot, the output of dump-slots-info is correct, it is saying that the other slot is the active.
  • nvbootctrl set-SR-BR N && reboot`. This is the only way that I’m able to switch the Slot with warm reboot. Using this command the board is booting from the new Slot and running properly. In case a cold reboot happens, it rollback to previous Slot.
  • Another way trying the update by copying capsule update to /opt/nvidia/esp/EFI/UpdateCapsule/TEGRA_BL.Cap and setting efi runtime variable OsIndications-8be4df61-93ca-11d2-aa0d-00e098032b8c to "\x07\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00". After rebooting the board it is booting from the same slot. After rebooting (in the same slot) the dump shows Capsule update status: 0.

I currently have two devices with this issue. The first one was flashed by micro-USB once and was running for few weeks without any issue but now it isn’t switch rootfs. This device is currently running slot A.
The seconds device was flashed by micro-USB, received a firmware update, was able to update and switch the Slot without any issue. Currently it isn’t able to switch Slots. This device is running on slot B.

The only observation I see in both devices is the 04b37fe8-f6ae-480b-bdd5-37d98c5e89aa-VarErrorFlag showing ef as written at Redundant A/B rootfs not switching with set-active-boot-slot but working with set-SR-BR - #5 by diogojusten.
By checking EDK2 source code it says VAR_ERROR_FLAG_SYSTEM_ERROR edk2/MdeModulePkg/Include/Guid/VarErrorFlag.h at 21e8a85653e104385bfb8218fe22a72053bd3d2d · tianocore/edk2 · GitHub. Is this variable set in all boot or once it is set it needs to be cleared manually and what could cause the system to set this variable?
Thank you for your help in advance.
Bests,
Diogo

hello diogojusten,

you should also specify -t option for switching RootFS slots, otherwise it’ll switch bootloader slots (default settings)
for instance, $ sudo nvbootctrl -t rootfs set-active-boot-slot <slot>
besides, could you please also calls nvbootctrl verify, which to verify the bootloader and the rootfs boot.

BTW,
is it possible to upgrade to the latest JP-5 since we’ve public JetPack 5.1.4 release version recently.

If I’m not wrong by setting without -t rootfs it is switching both, also by using set-SR-BR it is switching. Anyhow, the same behavior:

jetson-agx-xavier:/# nvbootctrl  dump-slots-info
Current version: 35.4.1
Capsule update status: 0
Current bootloader slot: A
Active bootloader slot: A
num_slots: 2
slot: 0,             status: normal
slot: 1,             status: normal

jetson-agx-xavier:/# nvbootctrl set-active-boot-slot 1 -t rootfs

jetson-agx-xavier:/# nvbootctrl  dump-slots-info
Current version: 35.4.1
Capsule update status: 0
Current bootloader slot: A
Active bootloader slot: B
num_slots: 2
slot: 0,             status: normal
slot: 1,             status: normal

jetson-agx-xavier:/# nvbootctrl  dump-slots-info -t rootfs
Current rootfs slot: A
Active rootfs slot: B
num_slots: 2
slot: 0,             retry_count: 3,             status: normal
slot: 1,             retry_count: 3,             status: normal

jetson-agx-xavier:/# reboot

...

jetson-agx-xavier-/# nvbootctrl  dump-slots-info -t rootfs
Current rootfs slot: A
Active rootfs slot: A
num_slots: 2
slot: 0,             retry_count: 3,             status: normal
slot: 1,             retry_count: 3,             status: normal

jetson-agx-xavier:# nvbootctrl  dump-slots-info          
Current version: 35.4.1
Capsule update status: 0
Current bootloader slot: A
Active bootloader slot: A
num_slots: 2
slot: 0,             status: normal
slot: 1,             status: normal

jetson-agx-xavier:/# nvbootctrl verify
Info: variable BootChainFwStatus is not found.

jetson-agx-xavier:/# echo $?
0

Updating to JP5.4.1 could be an option for new devices, but for those in field that would require flashing locally isn’t an option.

hello diogojusten,

we’ve test and confirm it’s able to switch Rootfs-A/B on JetPack 5.1.4 correctly.

this should be failure on r35.4.1 only, we don’t back-port the fixes usually.
please upgrade Jetpack version (if that’s possible) to the latest JP-5 release version,
thanks

Hi @JerryChang,
thank you again for your reply.

An additional information, in case I re-flash the device with same image, everything works as expected and I can switch slots, I still didn’t get why after some time it is going to that wrong state.

Do you have anything to add about the VarErrorFlag?

jetson-agx-xavier:/# efivar  -p -n 04b37fe8-f6ae-480b-bdd5-37d98c5e89aa-VarErrorFlag
GUID: 04b37fe8-f6ae-480b-bdd5-37d98c5e89aa
Name: "VarErrorFlag"
Attributes:
        Non-Volatile
        Boot Service Access
        Runtime Service Access
Value:
00000000  ef                                                |.               |

Is there a way to reset or set the default efi variables? Not the runtime efi variables, but the persistent?

Sharing the way I’m currently updating the device:

  • System running Slot0. Switching slot with nvbootctrl set-SR-BR 1
  • System booting from the other slot
  • System running Slot1
  • Running the update (updating Slot0)
  • Cold reboot
  • System running Slot0 and updated (but can’t make Slot1 as active).

Thank you in advance.
Bests,
Diogo

hello diogojusten,

may I also confirm your kernel command line within extlinux.conf, for instance, there should be "root=PARTUUID=xxxx", right?
besides, please gather the kernel messages (i.e. $ dmesg | grep "Kernel command line") when booting from slot-A/B for comparison.

Hi @JerryChang,

jetson-agx-xavier:/# cat /proc/cmdline 
mminit_loglevel=4 console=ttyTCU0,115200 fbcon=map:0 video=efifb:off quiet pcie_aspm=off kpti=0 nospectre_v1 nospectre_v2 nospectre_bhb 

jetson-agx-xavier:/# dmesg | grep "Kernel command line"
[    0.000000] Kernel command line: mminit_loglevel=4 console=ttyTCU0,115200 fbcon=map:0 video=efifb:off quiet pcie_aspm=off kpti=0 nospectre_v1 nospectre_v2 nospectre_bhb 
[    7.521148] systemd[1]: Starting Generate network units from Kernel command line...
[    7.542014] systemd[1]: Finished Generate network units from Kernel command line.

The rootfs is mounted by /init (UART logs from my local device without slot switch issue):

...
[    8.129793] Run /init as init process
Mounting /dev/mmcblk0p2...
[    8.618127] EXT4-fs (mmcblk0p2): mounted filesystem with ordered data mode. Opts: (null)
Switching to rootfs on /dev/mmcblk0p2...
...

BR,
Diogo

BTW,

it’s VAR_ERROR_FLAG_SYSTEM_ERROR.

Yes, this I mentioned in a previous comment Redundant A/B rootfs not switching with set-active-boot-slot but working with set-SR-BR - #10 by diogojusten

Lets go back to the main question of this topic, why does nvbootctrl set-SR-BR 1 work and nvbootctrl set-active-boot-slot 1 doesn’t?

Hi All,
After almost 200 reboots and 40 times running OTA update, I was able to have the issue in a device where I’ve UART console access.
It seems edk2 isn’t accessing/reading the efi variables and not triggering slot switching.
In the next days, I’m enabling edk2 debug to understand better what is happening.

hello diogojusten,

FYI,
The set-SR-BR, it uses a scratch register to switch the slot, but it only works after the warm reboot.
The set-active-boot-slot, it uses UEFI variables to switch the slots, it can update BRBCT, so it also can work after the cold boot.

according to above,
nvbootctrl command options, set-active-boot-slot on AGX Xavier it sets UEFI variable via ESP partition to switch boot slot.
please check whether the ESP partition is mounted to /opt/nvidia/esp directory when triggering capsule update by above commands.

Once again, thank you for your reply @JerryChang.

ESP partition is mounted properly:

jetson-agx-xavier:/# lsblk  |grep esp
|-mmcblk0p41 259:11   0    64M  0 part /opt/nvidia/esp
jetson-agx-xavier:/# fdisk -l |grep -i efi
/dev/mmcblk0p41 60073216 60204287   131072    64M EFI System
jetson-agx-xavier:/# mount |grep efi
efivarfs on /sys/firmware/efi/efivars type efivarfs (ro,nosuid,nodev,noexec,relatime)
/dev/mmcblk0p41 on /boot/efi type vfat (rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro)

Unfortunately, the UART output didn’t give me much information as debug message are off by default, the only relevant information is about an ASSERT over VariableRuntimeDxe variable:

...
Jetson UEFI firmware (version v35.4.1 built on 2023-08-04T23:11:08+00:00)
ESC   to enter Setup.
F11   to enter Boot Manager Menu.
Enter to continue boot.
**  WARNING: Test Key is used.  **
......ASSERT [VariableRuntimeDxe] /home/diogo/work/distro/build/tmp/work/jetson_agx_xavier-oe4t-linux/edk2-firmware-tegra/35.4.1-r0/edk2-tegra/edk2/MdeModulePkg/Universal/Variable/RuntimeDxe/V
ariable.c(3255): !(((INTN)(RETURN_STATUS)(Status)) < 0)

L4TLauncher: Attempting Direct Boot
��I/TC: Secondary CPU 1 initializing
...