Need Help in Understanding Failover in RootFS A/B redundancy

sanaurrehman · March 17, 2023, 9:35am

Hi.

I need help in understanding something with root file system A/B redundancy feature. Reading the NVIDIA documentation, and browsing several guides and posts about rootfs A/B redundancy, I developed the following understanding of it:

In rootfs A/B redundancy, the system contains two root file systems. These two ‘slots’ are useful so that if one of the slots is corrupted/not bootable, the system can failover to the other slot (thus increasing the reliability of the Jetson in a production environment).

In order to implement this feature, I flashed my Jetson-AGX-Xavier with rootfs A/B redundancy using Jetpack 4.6. Then, to check whether this would work in a real data corruption scenario, I booted into slot A, corrupted the same slot (slot A), and then hard rebooted the system (by giving a power cycle). The thinking behind this was that in a real application, we would be using one slot (say slot A), and if that slot is corrupted by say a sudden power failure, then the system should boot into slot B.

However, the above mentioned test failed. The system would be forever stuck trying to boot from slot A (when slot A is clearly corrupted), and would not failover to slot B.

I carried out the above procedure to test this A/B redundancy feature for Jetpack 5.0.2, and for Jetpack 5.1 as well. However, everytime, I ran into the same outcome: The system would attempt to boot from the corrupted slot, be stuck there forever, and not boot from the other slot.

Finally, on Jetpack 5.0.2, I tried booting into slot B, then mounting and removing slot A file system, and then using nvbootctrl to boot to slot A. In this case, the system does indeed failover to slot B after failing to boot from slot A.

My question is this: Is the above method (of working in slot A, and corrupting/removing file system of slotA) the correct method for testing rootfs A/B redundancy? Isn’t this the most accurate representation of a real life data corruption? Or was my test method completely wrong, and I misunderstood things, and using the second method (slot B to remove slot A, then boot to slot A) is the correct way?

Any guidance on this would be appreciated. Thanks.

JerryChang · March 20, 2023, 3:12am

hello sanaurrehman,

may I confirm which Jetpack release version you’re working with?
or… you just like to confirm the test procedure of RootfsA/B redundancy?

here’re sample test steps for your reference,

please follow [Flashing the Target Board with a Redundant Root File Systems] to enable Rootfs redundancy.
please check both of RoofsA/B slots were available with… $ sudo nvbootctrl -t rootfs dump-slots-info
Try to corrupt current file system slot-A, you may delete the whole content, $ sudo rm -rf /*
Reboot
It should attempts total 3 times to boot from slot-A, then it boot from slot-B eventually.

it’s WDT (i.e. watchdog ) to trigger system reset. WDT triggers warm reboot, system will retry rootfs slot-A and switch to rootfs slot-B after retried 3 times.

when the roofs slot is corrupted. for example, slot-A, you’ll see that slot-A is marked as “unbootable”.
assume we’ve revise corrupted file system, you can either set the OS chain A status as “normal” in UEFI menu.
i.e. UEFI Menu → Device Manager → NVIDIA Configuration → L4T Configuration → OS chain A status.

you may also refer to below forum topics as see-also.
Setting bootable and unbootable AB rootFS slots for L4T r35.1
Jetpack 5.1 Kernel Panic does not lead to reboot with A/B System

sanaurrehman · March 20, 2023, 8:46am

Thankyou for the reply @JerryChang .

I am currently using Jetpack 5.0.2 (However, as mentioned, I have tested the RootfsA/B redundancy on Jetpack 4.6 and Jetpack 5.1 as well).

My question was specific to corrupting slot A from slot A: Why the system won’t reboot if I use slot A to corrupt slot A itself, and then reboot. Currently I am guessing that this interferes with the failover mechanism somehow. But still waiting for a solid answer from someone from NVIDIA (or someone else even). If you could help me out as to why the failover is not working in this specific scenario?

Additionally, I also need help regarding recovering slot status once the failover mechanism takes place. I have taken a look at the following thread:

Even though the thread is for Xavier-NX, I assumed that the process would be similar for Jetson-AGX-Xavier, and tried to run the following commands in an attempt to recover boot status of unbootable slot.

# cd /sys/firmware/efi/efivars/
# printf "\x07\x00\x00\x00" > /tmp/var_tmp.bin
# printf "\x3c\xc0\x01\x00" >> /tmp/var_tmp.bin
# chattr -i RootfsInfo-781e084c-a330-417c-b678-38e696380cb9
# dd if=/tmp/var_tmp.bin of=RootfsInfo-781e084c-a330-417c-b678-38e696380cb9; sync
# chattr +i RootfsInfo-781e084c-a330-417c-b678-38e696380cb9
# reboot

However, when I run the first chattr command (chattr -i RootfsInfo-781e084c-a330-417c-b678-38e696380cb9), I get the message:
chattr: Read-only file system while setting flags on RootfsInfo-781e084c-a330-417c-b678-38e696380cb9.

Is the above mentioned method supported for Jetson-AGX-Xavier as well? If not, then what method could be used to set the status of the unbootable slot back to bootable again?

If the method is supported for AGX-Xavier as well, then how can I change attribute of the read-only file Read-only file RootfsInfo-781e084c-a330-417c-b678-38e696380cb9, as using the chattr command as mentioned above doesnt seem to be working.

sanaurrehman · March 20, 2023, 12:45pm

Update: I managed to resolve the earlier issue of Read-only file system with the chattr command by remounting the /sys/firmware/efi/efivars directory as read-write using the command: mount -o remount,rw /sys/firmware/efi/efivars.

Now, I am able to execute the commands till before the dd command. When I enter the dd command, the system goes into kernel panic.

How can I resolve this issue so that I can mark slot A as bootable again?

For reference, here are the steps that I followed:

Flashed RootfsA/B. Confirmed that it was working correctly using nvbootctrl.
Booted to slot B. Corrupted slot A from slot B, and then rebooted to slot A using nvbootctrl.
The failover to slotB takes place as slot A is corrupted. Using dump-slots-info shows slotA as unbootable, with retry count = 0.
In slot B, I copy the rootfs of slotA in the mmcblk0p1 (APP partition).
Run the above mentioned commands to attempt to set slotA as bootable again. However, when I get to dd command, system goes to kernel panic. (Running the dd command sends the system to kernel panic state)

Note: Using Jetpack 5.0.2. Tried reflashing and trying again, but still same issue. Objective here is to recover slot A after it has been corrupted.
Any help would be much appreciated. Thanks!

JerryChang · March 21, 2023, 5:43am

hello sanaurrehman,

please refer to Topic 245215 for device tree changes to enable watchdog for trigger system reset.
as you know… we got several forum posts with RootfsA/B redundancy. we’re also checking this internally for solid test approaches, documentation refinement…etc

sanaurrehman · March 21, 2023, 5:48am

@JerryChang , I have seen the mentioned topic. It is for Jetpack 5.1, in which watchdog is disabled by default. I am currently using Jetpack 5.0.2, which does not have any watchdog issue.

Also, for me, running: “cat /proc/device-tree/watchdog@30c0000/status” gives the status of watchdog as okay. So there is no issue in watchdog.

However, I am currently facing issue in recovering status of slot A after it has been corrupted. How can I reset its status back to normal again (in nvbootctrl) from unbootable?

Regards,
Sana Ur Rehman

JerryChang · March 21, 2023, 6:22am

you should update the slot settings via UEFI menu.
i.e. UEFI Menu → Device Manager → NVIDIA Configuration → L4T Configuration → OS chain A status.

sanaurrehman · March 21, 2023, 6:23am

@JerryChang , once again, the method you mention is for Jetpack 5.1. How can I do the same in Jetpack 5.0.2?
(Kindly read my third comment again)

When I try to run the following commands, I get kernel panic (please see attached file for more details)
KernelPanic.txt (3.5 KB)
.

# cd /sys/firmware/efi/efivars/
# printf "\x07\x00\x00\x00" > /tmp/var_tmp.bin
# printf "\x3c\xc0\x01\x00" >> /tmp/var_tmp.bin
# chattr -i RootfsInfo-781e084c-a330-417c-b678-38e696380cb9
# dd if=/tmp/var_tmp.bin of=RootfsInfo-781e084c-a330-417c-b678-38e696380cb9; sync
# chattr +i RootfsInfo-781e084c-a330-417c-b678-38e696380cb9
# reboot

Note: Removing the file RootfsInfo-781e084c-a330-417c-b678-38e696380cb9 also results in kernel panic.

JerryChang · March 23, 2023, 2:50pm

hi all,

we’ve test this locally, so far, there’s only Xavier-NX-eMMC with l4t-r35.2.1 is able to perform RootfsA/B redundancy successfully.

there’s an issue for running RootfsA/B redundancy on AGX Xavier.
we’ve track this internally. will update the status after we come out conclusions.

sanaurrehman · March 24, 2023, 4:37am

Thanks for the update @JerryChang . Hoping for a quick solution.

Regards,
Sana Ur Rehman

sjj · October 4, 2024, 12:04pm

Hi @JerryChang, has this issue been resolved? It has been 18 months since this post.

Thanks.

JerryChang · October 7, 2024, 3:13am

no, Rootfs-A/B redundancy doesn’t work for all scenarios of root file system corruption.
for example, Rootfs-A/B failed with $ sudo rm -rf /* to simulate a slot corruption.
it’s the issue that kernel panic didn’t trigger the watchdog to warm reboot the device.

sanaurrehman · October 7, 2024, 4:09am

Hi @sjj . The method defined by Kevin in this post works for AGX Xavier using Jetpack 5.1.2:

It doesn’t work for earlier Jetpack versions. (The endless reboot issue mentioned here was resolved in Jetpack 5.1.2)

system · November 6, 2024, 3:32am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Rootfs A/B redundancy fail-over mechanism in Jetpack5.1 Jetson Xavier NX kb	13	4951	March 18, 2024
A/B ROOTFS Redundancy: Bootloader does not boot from backup slot when the working slot is intentionally corrupted Jetson AGX Xavier security , nvbugs	15	2075	March 24, 2023
L4T 5.1 reboot loop after enabling watchdog with RootFS A/B Jetson Xavier NX nvbugs	23	2067	August 1, 2023
Jetpack 5.1 Kernel Panic does not lead to reboot with A/B System Jetson Xavier NX boot , nvbugs	21	2406	March 20, 2023
Xavier NX A/B Failover Jetson Xavier NX ota	8	862	March 27, 2024
Setting bootable and unbootable AB rootFS slots for L4T r35.1 Jetson AGX Xavier security	37	3599	April 25, 2023
Jetpack 5.1 needs clarification from NVIDIA! Jetson Xavier NX boot , nvbugs	5	1069	April 25, 2023
Redundant A/B rootfs not switching with set-active-boot-slot but working with set-SR-BR Jetson AGX Xavier ota	33	372	October 2, 2024
Mark bootloader bootable Jetson Xavier NX security	27	2341	November 29, 2022
Bootloader does not fall-back to slot A when Slot B can't boot (rootfs A/B) Jetson AGX Xavier security	10	3096	February 23, 2022

Need Help in Understanding Failover in RootFS A/B redundancy

Related topics