Need Help in Understanding Failover in RootFS A/B redundancy

Hi.

I need help in understanding something with root file system A/B redundancy feature. Reading the NVIDIA documentation, and browsing several guides and posts about rootfs A/B redundancy, I developed the following understanding of it:

In rootfs A/B redundancy, the system contains two root file systems. These two ‘slots’ are useful so that if one of the slots is corrupted/not bootable, the system can failover to the other slot (thus increasing the reliability of the Jetson in a production environment).

In order to implement this feature, I flashed my Jetson-AGX-Xavier with rootfs A/B redundancy using Jetpack 4.6. Then, to check whether this would work in a real data corruption scenario, I booted into slot A, corrupted the same slot (slot A), and then hard rebooted the system (by giving a power cycle). The thinking behind this was that in a real application, we would be using one slot (say slot A), and if that slot is corrupted by say a sudden power failure, then the system should boot into slot B.

However, the above mentioned test failed. The system would be forever stuck trying to boot from slot A (when slot A is clearly corrupted), and would not failover to slot B.

I carried out the above procedure to test this A/B redundancy feature for Jetpack 5.0.2, and for Jetpack 5.1 as well. However, everytime, I ran into the same outcome: The system would attempt to boot from the corrupted slot, be stuck there forever, and not boot from the other slot.

Finally, on Jetpack 5.0.2, I tried booting into slot B, then mounting and removing slot A file system, and then using nvbootctrl to boot to slot A. In this case, the system does indeed failover to slot B after failing to boot from slot A.

My question is this: Is the above method (of working in slot A, and corrupting/removing file system of slotA) the correct method for testing rootfs A/B redundancy? Isn’t this the most accurate representation of a real life data corruption? Or was my test method completely wrong, and I misunderstood things, and using the second method (slot B to remove slot A, then boot to slot A) is the correct way?

Any guidance on this would be appreciated. Thanks.

hello sanaurrehman,

may I confirm which Jetpack release version you’re working with?
or… you just like to confirm the test procedure of RootfsA/B redundancy?

here’re sample test steps for your reference,

  1. please follow [Flashing the Target Board with a Redundant Root File Systems] to enable Rootfs redundancy.
  2. please check both of RoofsA/B slots were available with… $ sudo nvbootctrl -t rootfs dump-slots-info
  3. Try to corrupt current file system slot-A, you may delete the whole content, $ sudo rm -rf /*
  4. Reboot
  5. It should attempts total 3 times to boot from slot-A, then it boot from slot-B eventually.

it’s WDT (i.e. watchdog ) to trigger system reset. WDT triggers warm reboot, system will retry rootfs slot-A and switch to rootfs slot-B after retried 3 times.

when the roofs slot is corrupted. for example, slot-A, you’ll see that slot-A is marked as “unbootable”.
assume we’ve revise corrupted file system, you can either set the OS chain A status as “normal” in UEFI menu.
i.e. UEFI Menu → Device Manager → NVIDIA Configuration → L4T Configuration → OS chain A status.

you may also refer to below forum topics as see-also.
Setting bootable and unbootable AB rootFS slots for L4T r35.1
Jetpack 5.1 Kernel Panic does not lead to reboot with A/B System

Thankyou for the reply @JerryChang .

I am currently using Jetpack 5.0.2 (However, as mentioned, I have tested the RootfsA/B redundancy on Jetpack 4.6 and Jetpack 5.1 as well).

My question was specific to corrupting slot A from slot A: Why the system won’t reboot if I use slot A to corrupt slot A itself, and then reboot. Currently I am guessing that this interferes with the failover mechanism somehow. But still waiting for a solid answer from someone from NVIDIA (or someone else even). If you could help me out as to why the failover is not working in this specific scenario?

Additionally, I also need help regarding recovering slot status once the failover mechanism takes place. I have taken a look at the following thread:

Even though the thread is for Xavier-NX, I assumed that the process would be similar for Jetson-AGX-Xavier, and tried to run the following commands in an attempt to recover boot status of unbootable slot.

# cd /sys/firmware/efi/efivars/
# printf "\x07\x00\x00\x00" > /tmp/var_tmp.bin
# printf "\x3c\xc0\x01\x00" >> /tmp/var_tmp.bin
# chattr -i RootfsInfo-781e084c-a330-417c-b678-38e696380cb9
# dd if=/tmp/var_tmp.bin of=RootfsInfo-781e084c-a330-417c-b678-38e696380cb9; sync
# chattr +i RootfsInfo-781e084c-a330-417c-b678-38e696380cb9
# reboot

However, when I run the first chattr command (chattr -i RootfsInfo-781e084c-a330-417c-b678-38e696380cb9), I get the message:
chattr: Read-only file system while setting flags on RootfsInfo-781e084c-a330-417c-b678-38e696380cb9.

Is the above mentioned method supported for Jetson-AGX-Xavier as well? If not, then what method could be used to set the status of the unbootable slot back to bootable again?

If the method is supported for AGX-Xavier as well, then how can I change attribute of the read-only file Read-only file RootfsInfo-781e084c-a330-417c-b678-38e696380cb9, as using the chattr command as mentioned above doesnt seem to be working.

Update: I managed to resolve the earlier issue of Read-only file system with the chattr command by remounting the /sys/firmware/efi/efivars directory as read-write using the command: mount -o remount,rw /sys/firmware/efi/efivars.

Now, I am able to execute the commands till before the dd command. When I enter the dd command, the system goes into kernel panic.

How can I resolve this issue so that I can mark slot A as bootable again?

For reference, here are the steps that I followed:

  1. Flashed RootfsA/B. Confirmed that it was working correctly using nvbootctrl.
  2. Booted to slot B. Corrupted slot A from slot B, and then rebooted to slot A using nvbootctrl.
  3. The failover to slotB takes place as slot A is corrupted. Using dump-slots-info shows slotA as unbootable, with retry count = 0.
  4. In slot B, I copy the rootfs of slotA in the mmcblk0p1 (APP partition).
  5. Run the above mentioned commands to attempt to set slotA as bootable again. However, when I get to dd command, system goes to kernel panic. (Running the dd command sends the system to kernel panic state)

Note: Using Jetpack 5.0.2. Tried reflashing and trying again, but still same issue. Objective here is to recover slot A after it has been corrupted.
Any help would be much appreciated. Thanks!

hello sanaurrehman,

please refer to Topic 245215 for device tree changes to enable watchdog for trigger system reset.
as you know… we got several forum posts with RootfsA/B redundancy. we’re also checking this internally for solid test approaches, documentation refinement…etc

@JerryChang , I have seen the mentioned topic. It is for Jetpack 5.1, in which watchdog is disabled by default. I am currently using Jetpack 5.0.2, which does not have any watchdog issue.

Also, for me, running: “cat /proc/device-tree/watchdog@30c0000/status” gives the status of watchdog as okay. So there is no issue in watchdog.

However, I am currently facing issue in recovering status of slot A after it has been corrupted. How can I reset its status back to normal again (in nvbootctrl) from unbootable?

Regards,
Sana Ur Rehman

you should update the slot settings via UEFI menu.
i.e. UEFI Menu → Device Manager → NVIDIA Configuration → L4T Configuration → OS chain A status.

@JerryChang , once again, the method you mention is for Jetpack 5.1. How can I do the same in Jetpack 5.0.2?
(Kindly read my third comment again)

When I try to run the following commands, I get kernel panic (please see attached file for more details)
KernelPanic.txt (3.5 KB)
.

# cd /sys/firmware/efi/efivars/
# printf "\x07\x00\x00\x00" > /tmp/var_tmp.bin
# printf "\x3c\xc0\x01\x00" >> /tmp/var_tmp.bin
# chattr -i RootfsInfo-781e084c-a330-417c-b678-38e696380cb9
# dd if=/tmp/var_tmp.bin of=RootfsInfo-781e084c-a330-417c-b678-38e696380cb9; sync
# chattr +i RootfsInfo-781e084c-a330-417c-b678-38e696380cb9
# reboot

Note: Removing the file RootfsInfo-781e084c-a330-417c-b678-38e696380cb9 also results in kernel panic.

hi all,

we’ve test this locally, so far, there’s only Xavier-NX-eMMC with l4t-r35.2.1 is able to perform RootfsA/B redundancy successfully.

there’s an issue for running RootfsA/B redundancy on AGX Xavier.
we’ve track this internally. will update the status after we come out conclusions.

1 Like

Thanks for the update @JerryChang . Hoping for a quick solution.

Regards,
Sana Ur Rehman

Hi @JerryChang, has this issue been resolved? It has been 18 months since this post.

Thanks.

no, Rootfs-A/B redundancy doesn’t work for all scenarios of root file system corruption.
for example, Rootfs-A/B failed with $ sudo rm -rf /* to simulate a slot corruption.
it’s the issue that kernel panic didn’t trigger the watchdog to warm reboot the device.

Hi @sjj . The method defined by Kevin in this post works for AGX Xavier using Jetpack 5.1.2:

It doesn’t work for earlier Jetpack versions. (The endless reboot issue mentioned here was resolved in Jetpack 5.1.2)

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.