Hi,
We are attempting to implement A/B rootfs redundancy on Jetson AGX Xavier, so that when our root file system in slot A gets corrupted, the bootloader should load from slot B, and we can use fsck or something similar to get the rootfs in slot A back running again.
For this purpose, we used Jetpack 4.6 to flash rootfs A/B. The flashing is successful, and after flashing and adding user account details, we could switch between the two rootfs slots using the nvbootctrl utility.
However, the purpose of us using rootfs redundancy is that in a real scenario when the rootfs in slot A is unable to boot for whatever reason, slot B can be used. Therefore, to mimic such a scenario, we used the dd command to write zeros to the current slot, slot A. Active slot was also slot A at the time. The dd command we used was as follows:
We assumed that this would corrupt the rootfs in slot A, and the bootloader would eventually boot from slot B after failing three times (the retry counts are at their default values). However, after running the dd command, and then rebooting using “sudo shutdown -r now”, the bootloader does not boot from slot B. It tries to boot from slot A, and fails, because the rootfs in slot A is corrupted, but does not boot from slot B even after waiting for up to 15 minutes.
Manually rebooting the board (multiple times) by giving a power cycle also gives similar result, with the system unable to boot from slot A, and not attempting to boot from slot B.
Could anybody point me to what I am doing wrong? As according to my understanding, when rootfs redundancy is in place, if a slot is not bootable, then the reserve/backup/second slot is used to boot the system. (I am fairly new to all of this, so any details would be very much appreciated. Thanks!) (We have unified bootloader, and bootloader autosync is disabled). UART log.txt (49.7 KB)
The UART log is in the attached file. Kindly let me know if any more info is required.
Thankyou for the reply @JerryChang . I have seen Topic 243516.
However, in topic 243516, the user is able to boot to slot B when slot A is not functional, as can be seen in the first comment by waldman:
“We have already written zeros to A_mb1 and demonstrated the failover to slot B. Worked like a charm!”
However, in our case, it is not working like a charm. Infact, it is not working at all.
Now as mentioned in my first comment, I tried to test failover to slot B by:
Booting into slot A
Setting active slot as slot A (so current and active slot were both set to slot A. Note that slot B was prefectly functional, and was bootable when using sudo nvbootctrl dump-slots-info).
Using the dd command to randomly corrupt a portion of the rootfs in slot A (the partition mmcblk0p1, as rootfs for slot A was stored in APP partition, which was mmcblk0p1).
Rebooted the system using command sudo shutdown -r now to see if the failover mechanism works, and whether the system boots into slot B (as slot A rootfs should now be corrupted).
However, after rebooting, the system does not boot using slot B, even though it fails to boot using slot A (as expected, since slot A rootfs was intentionally corrupted). The UART log for this test is in the file “UART log.txt” in the first comment.
Now, since waldman in Topic 243516 did failover testing by corrupting the “A_mb1” partition (which I assume is the partition for MB1), therefore I decided to do the same to reproduce the results of waldman. However, I cannot find any partition with the label mb1, when I use the command “ls -al /dev/disk/by-partlabel”. The result of “ls -al /dev/disk/by-partlabel” for me is:
As you can see, there is no partition with the label mb1, or anything similar.
I have since browsed the nvidia documentation, and also read the following threads:
where the user seeky15 performs failover to slot B by wiping the entire rootfs in slot A.
Reading these threads and the documentation, it occurred to me that maybe the failover mechanism failed to boot from slot B in my case was because the “extlinux.conf” file in /boot/extlinux was seemingly intact in my case. Additionally, the “kernel” partition, and all other partitions (with the exception of APP partition ofcourse) were all ok. Therefore, it is possible that the bootloader did not find anything wrong with slot A (since the bootflow was ok), and therefore proceeded to boot from slot A. Therefore, I decided to interrupt the bootflow by removing the “/boot/Image” file, and also corrupting the “kernel” partition by using the command:
sudo dd if=/dev/zero of=/dev/mmcblk0p35 bs=1k seek=10 count=4k (The kernel partition is mmcblk0p35)
However, this did not work either. The UART log for this test is attached in the file “UART log after crashing kernel partition with 4MB of zeros.txt”.
Next, I decided to crash the whole “kernel” partition. The size of this partition was 80MB. So I wrote 80MB of zeros using the dd command:
Then, I rebooted using the command “sudo shutdown -r now”. However, this did not work either, with the system still not booting from slot B. The UART log for this test is in the attached file “UART log after crashing kernel partition with 80MB of zeros.txt”.
Finally, since seeky15 wiped the whole rootfs, therefore I corrupted the whole APP partition using the dd command in a similar fashion as mentioned before. However, this also did not work, with the system still not booting from slot B. The log for this test is in the file “UART log after crashing rootfsA partition with 14GB of zeros.txt”
For all of these tests, I tried waiting upto 15 minutes to see if the system goes to slot B, and also tried multiple boots by giving power cycle.
The answer given by @JerryChang is a bit misleading for your issue.
What he is talking about is Jetpack version 5.1. The threads you found are all about Jetpack 5.x
You have an issue with 4.x. I think that should be addressed differently. I am not aware of any failover issues for the 4.x since I’ve only worked with 5.x.
4.x uses Cboot, 5.x uses Uefi. So both cannot be compared and require a different solution I am afraid.
Thankyou for this piece of info @seeky15 . Really appreciate it. Hopefully @JerryChang or someone else can give me some clue as to what I am doing wrong.
Update: Tried using A/B redundancy using Jetpack 5.1. The flashing goes fine, and after flashing, I can easily switch between the two slots using nvbootctrl utility.
Next, tried corrupting the entire rootfs A slot (APP partition) using the dd command. Then rebooted. However, still no luck. The system fails to boot from slot A, and does not failover to slot B no matter how long I wait, or how many times I reboot the system.
The UART log for this test is attached.
Note: Tried Jetpack 5.1 only as a test to see if the method I was using was ok. We would like to stay with Jetpack 4.6 if possible. However, if there is no other choice then we may switch to Jetpack 5.1, but the preference is to get this to work with Jetpack 4.6.
Normally, if one rootfs slot is unbootable, the kernel watchdog will reboot the device, and if fails to boot into it for consecutive 3 times, then the UEFI try to boot from another rootfs slot.
The logic is that we saves this failed boot status in the scratch register, once it reaches to 3, the UEFI will switch the slot and update the slot status to the UEFI variables, then the device can boot from the new slot.
however, if you use dd to corrupt the filesystem, it’s unsure if the device can be reboot by the watchdog. you can enter into the UEFI menu to select the rootfs slot.
@JerryChang , I am running Jetpack 4.6. So far, it seems all the answers you have provided are for Jetpack 5.x, which are not helping me. (While I appreciate the help, my current system is on Jetpack 4.6, and I would like to stay on Jetpack 4.6. Otherwise, I would have to reconfigure everything from scratch if I go to some other version of Jetpack).
However, for the sake of testing, I switched to Jetpack 5.1. But the A/B redundancy is not failing over to slot B in Jetpack 5.1 as well (as mentioned in my previous comment).
However, in Jetpack 5.1, when connecting display, we get a shell prompt, where we can enter the commands (I assume that the system booted into recovery mode instead of booting from slot B).
In this shell, I rebooted manually 3 times by entering the “reboot” command, after which the system eventually booted from slot B.
My question is this: why is this not happening automatically like it is supposed to? If the switch to backup slot does not happen automatically, then this feature is not useful at all, since in an embedded system environment, we do not expect to interact with the Jetson. Human intervention is not an option, we wish it to failover to slot B automatically, as it should.
Regarding your last reply with the results being unpredictable when using dd command, I have also seen the following post by @seeky15 , who is NOT using dd command, but removing the rootfs, and the failover is not working for seeky15 as well.
Note that this process of removing rootfs was working for @seeky15 in Jetpack 5.0.2, as can be seen in the following post:
What changed in Jetpack 5.1? Why does it not work anymore? Additionally, if the dd command method is wrong, then please guide me as to how I can test whether the failover mechanism is working or not? (Since removing the rootfs is not working either, as evidenced by seeky15’s testing).
And how can we expect it work in a real environment, where the nature of data corruption is random?
I have also tried corrupting the rootfs A slot from about 1MB in (was previously corrupting from 10kB in the APP partition). (sudo dd if=/dev/zero of=/dev/mmcblk0p1 bs=1k seek=1024 count=4k), but same result. It still does not failover to slot B.
Additonally, where is the kernel watchdog stored? Where is the scratch register stored? In the rootfs partition? Or in some other partition? Please advise as how to proceed. How to do I test this feature if not by using dd command? (Note: Any help would be welcome, whether it is for Jetpack 4.6 or Jetpack 5.x, as I would like to successfully implement this feature first and foremost, even though staying on Jetpack 4.6 is preferred).
Sorry @sanaurrehman to get this into your thread here, but I’ve got to respond to this bull**** that applies to 4.x and 5.x
@JerryChang If the failover mechanism does not work when removing the rootfs, in what case should it work? Is that not the idea of the whole thing? To get your system to boot again if someone turned off power right after your update mechanism whiped your whole rootfs? Can you give any example what SHOULD work?
Additionally, as you mention 5.x again, despite @sanaurrehman asking for help for 4.x…There is no way to select a boot slot in the UEFI of 5.1? Are we talking about the same OS?
@seeky15 , I agree with your assessment of the situation.
My point exactly.
To select the boot slot in UEFI of 5.1, you will need to connect an HDMI display, and a keyboard. Then, on powerup, when the NVIDIA logo appears, press Esc to enter the UEFI menu. Then go to Device Manager > NVIDIA Configuration > L4T Configuration. From there you can change the boot slot. I havent tested this myself, but I think you can boot into backup slot from here.
However, this human intervention defeats the purpose of having a backup file system in the first place. (The system should failover to backup slot automatically without human intervention. Otherwise, we cant really use it in a real life system).
please ignore UEFI, that’s necessary only for JP-5.0.1 or later.
and… sorry about the confusion. you know, too many Jetpack release versions we’re supported right now.
let’s focus on RootfsA/B with JP-4.6 release version.
you should refer to this documentation. Root File System Redundancy.
the Cboot will decrease the retry_count of the current rootfs slot.
When the device boot into the rootfs, there will run a background service nv_update_verifier.service, this service will trigger l4t-rootfs-validation-config.service to say the boot is successful.
however, if the validation script doesn’t exist or returns true, that means the rootfs boot-up successful also.
If the rootfs validation is true, then the “nv_update_verifier.service” will run “/usr/sbin/nv_update_engine --verify”, the nv_update_engine will increase the retry_count and update slot status.
it’s nvbootctrl to switch the rootfs slot. l4t-rootfs-validation-config.service to validate the rootfs.
so,
you should also have a script, l4t-rootfs-validation-config.sh to handle it.
you may found this sample within release image.
i.e. ./rootfs/opt/nvidia/l4t-rootfs-validation-config/l4t-rootfs-validation-config.sh
please have a try, looking forward your test results.
see-also topics, Topic 198703, Topic 197124.
The documentation you referred to states that the customer-provided script /usr/sbin/user_rootfs_validation.sh should return zero (or not be defined) for boot to be marked successful.
While you stated that it needs to return true to mark the boot successful. Which one is it? Currently, I am assuming the documentation is the one which is correct.
Based on that, I carried out a test. My validation script is this:
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RUNNING THE USER VALIDATION SCRIPT"
echo "RETURNING ZERO, WHICH WILL INDICATE A SUCCESSFUL BOOT"
touch /home/rwra/Desktop/scriptruns
return 0
This script should always return 0, meaning that the boot should be marked successful.
However, this script is not running at all. I flashed the system with rootfs redundancy using Jetpack 4.6, and added the above script in rootfs A under /usr/sbin/user_rootfs_validation.sh
Then I rebooted the system. The system goes straight to login, but I do not see any indication that the script runs in the background. (There are no prints saying “RUNNING THE USER VALIDATION SCRIPT” to be seen on the terminal, or on the monitor when connecting the display. Additionally, there is no file created on the Desktop. Which means the script is clearly not running. I also logged in, and waited a few minutes to see if the script runs after login. But it doesn’t. Why is that? When do these services (nv_update_verifier.service and l4t-rootfs-validation-config.service) run? If these services are running, then why is the script not running?
Update: Manually tried to run the nv_update_verifier.service using command: systemctl start nv_update_verifier.service
The system asks for password, after which the service runs. However, even after manually running the service, I cannot see any prints, or any file being created on Desktop. Therefore, the validation script is still not running. Is it possible that the l4t-rootfs-validation-config.service is not running? If not, then what could be the reason?
Note: Using the command “systemctl list-unit-files --type service -all” to list all the services shows the nv_update_verifier.servicein the list of available services, but does not show l4t-rootfs-validation-config.service.
we’ve test this locally, so far, there’s only Xavier-NX-eMMC with l4t-r35.2.1 is able to perform RootfsA/B redundancy successfully.
we’re also repo the failure with AGX Xavier on r32.7.3 and also r35.2.1.
this issues is now on track internally. will update the status after we come out conclusions.