Rootfs slot falls to unbootable status

Jetpack version is 6.0, jetson linux version is r36.3
I enabled Rootfs A/B in Jetson AGX orin, the system runs on Slot A by default.
During usage, after a certain reboot, it was found that the system had switched to Slot B , and the status of Slot A became unbootable .

:~$ sudo nvbootctrl -t rootfs dump-slots-info
[sudo] password for pilot:
Current rootfs slot: B
Active rootfs slot: B
num_slots: 2
slot: 0,             retry_count: 0,             status: unbootable
slot: 1,             retry_count: 3,             status: normal

Even when manually setting Slot A as the active boot slot using:

sudo nvbootctrl set-active-boot-slot 0  

the system still could not switch back to Slot A.

I tried to mount the APP partition (mmcblk0p1 ) of Slot A under /tmp , and the following files were checked:

boot/Image
boot/extlinux/extlinux.conf
boot/tegra234-p3737-0000+p3701-0004-*.dtb

All necessary files existed, and their MD5 checksums were correct .
The extlinux.conf file contained root=PARTUUID=2534d15b-e6fb-4265-b913-e042ab311f4f, which is correctly matched the PARTUUID of the APP partition.

Questions:

  1. If the kernel files are correct, why can’t the system boot into Slot A?
  2. Under what conditions would a slot be marked as unbootable?

*** Please note that this reply is generated by LLM automatically ***

Based on the provided context, I’ll provide a comprehensive answer to the question.

Issue Summary

The Jetson AGX Orin system, running Jetpack version 6.0 and Jetson Linux version r363, has Rootfs A/B enabled. The system initially boots into Slot A by default. However, after a certain reboot, the system switches to Slot B, and Slot A becomes inactive. Despite manually setting Slot A as the active boot slot, the system cannot switch back to Slot A. The necessary files, including kernel files, exist and have correct checksums.

Analysis

The issue seems to be related to the Rootfs A/B redundancy feature, which allows the system to switch between two root file systems (Slot A and Slot B) in case of a failure. In this scenario, the system has switched to Slot B, and Slot A is marked as inactive.

Possible Causes

  1. Corrupted or incomplete kernel files: Although the necessary files exist and have correct checksums, there might be a corruption or incompleteness in the kernel files that prevents the system from booting into Slot A.
  2. Incorrect PARTUUID: The extlinux.conf file contains the correct PARTUUID, but there might be an issue with the partition table or the file system that prevents the system from recognizing the correct partition.
  3. Bootloader configuration: The bootloader configuration might be incorrect or corrupted, preventing the system from booting into Slot A.
  4. Hardware or firmware issue: There might be a hardware or firmware issue that prevents the system from booting into Slot A.

Conditions for a slot to be marked as inactive

A slot can be marked as inactive due to various reasons, including:

  1. Boot failure: If the system fails to boot from a slot, it might mark that slot as inactive.
  2. Corrupted or incomplete files: If the necessary files, including kernel files, are corrupted or incomplete, the system might mark the slot as inactive.
  3. Incorrect bootloader configuration: An incorrect bootloader configuration can prevent the system from booting into a slot, leading to it being marked as inactive.
  4. Hardware or firmware issue: A hardware or firmware issue can prevent the system from booting into a slot, causing it to be marked as inactive.

Recommendations

  1. Verify the kernel files: Double-check the kernel files for any corruption or incompleteness.
  2. Check the partition table and file system: Verify the partition table and file system for any issues that might prevent the system from recognizing the correct partition.
  3. Reconfigure the bootloader: Try reconfiguring the bootloader to see if it resolves the issue.
  4. Run a hardware diagnostic: Run a hardware diagnostic to identify any potential hardware or firmware issues.

By following these recommendations, you should be able to identify and resolve the issue preventing the system from booting into Slot A.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

hello jingsusu,

it’s according to developer guide, Fail-over Rootfs Slot Switching,
you may refer to auto-reply for the conditions for a slot to be marked as inactive for reference.

as you can see.. Managing Rootfs Slots with nvbootctrl.
it’s by default to switch bootloader slots, you should try to add -t options to specify rootfs slots.
for instance, $ sudo nvbootctrl -t rootfs set-active-boot-slot <slot>
besides.. please also setup serial console to gather the complete booting logs for reference.

hello JerryChang,

The auto-reply answer wasn’t clear enough. For instance, how to verify if bootloader configuration is incorrect?

Is there more detailed documentation on the mechanism for marking a slot as unbootable ? The provided guide only mentions the number of retries but doesn’t specify the exact conditions (e.g., which file states or checks) to determine an unbootable status.

Where is the code that deal with unbootable state detection located in jetson_linux_r36.3?

hello jingsusu,

there’re background service to validate the bootloader/rootfs boot status
for instance,
/etc/systemd/system/l4t-rootfs-validation-config.service
/etc/systemd/system/nv-l4t-bootloader-config.service

hello , JerryChang

I find that l4t-rootfs-validation-config.service comes from nvidia-l4t-init_36.3.0-20240506102626_arm64.deb.

l4t-rootfs-validation-config.service executes /opt/nvidia/l4t-rootfs-validation-config/l4t-rootfs-validation-config.sh during validation.

In /opt/nvidia/l4t-rootfs-validation-config/l4t-rootfs-validation-config.sh:

# This script runs customer specific rootfs validation function to check
# if the root filesystem boots successfully or not.

echo "Checking if the root filesystem boots successfully"

# If the root filesystem fails to boot, make sure to reboot the device,
# so that the nv_update_engine will not update boot status to successful.


# The user-provied rootfs validation script
# Fixed script location and name.
user_rootfs_validation="/usr/sbin/user_rootfs_validation.sh"

# Return:
#  0: success
#  1: failed
#
rootfs_validation ()
{
        if [ -f "${user_rootfs_validation}" ];then
                if "${user_rootfs_validation}"; then
                        # rootfs validate success.
                        return 0
                else
                        # rootfs validate failed.
                        return 1
                fi
        else
                # user rootfs validation script doesn't exist
                return 0
        fi
}

#
# Call rootfs validation function. If failed,
# trigger device reboot
#
if ! rootfs_validation; then
        echo "The root filesystem failed to boot. Will reboot the device."
        reboot
        while true; do sleep 1; done
fi

exit 0

The validation seems to be completed by /usr/sbin/user_rootfs_validation.sh if this file exists, otherwise the validation is skipped.

But I can’t find this file neither in tegra_linux_sample-root-filesystem_r36.3.0_aarch64.tbz2 nor in nvidia-l4t-init_36.3.0-20240506102626_arm64.deb, which means validation is skipped on orins.

So how does l4t-rootfs-validation-config.service modify slot status to unbootable?

hello jingsusu,

those services should be there, please check /etc/systemd/ on your target for confirmation.

I have checked that l4t-rootfs-validation-config.service exists, but /usr/sbin/user_rootfs_validation.sh is missing. So where does user_rootfs_validation.sh comes from?

hello jingsusu,

it’s user-provied rootfs validation script.
as you can see per.. /opt/nvidia/l4t-rootfs-validation-config/l4t-rootfs-validation-config.sh

by default, it’s depends-on watch dog timer to warm-reset the system, and then decrease the retry counts (ROOTFS_RETRY_COUNT_MAX), once the retry 3 times, it’ll switch to another slot.

let me share part of UART logs for reference..

[    8.635484] Kernel panic - not syncing:
[    8.635489] Attempted to kill init! exitcode=0x00007f00
[    8.635494] CPU: 3 PID: 1 Comm: chroot Not tainted 5.15.163-tegra #1

# truncate kernel panic logs...

...
�ÿâ
[0000.062] I> MB1 (version: 1.4.0.5-t234-54845784-83dcb6d1)
[0000.067] I> t234-A01-1-Silicon (0x12347) Prod
[0000.072] I> Boot-mode : Coldboot
[0000.075] I> Entry timestamp: 0x00000000
[0000.078] I> last_boot_error: 0x0
[0000.082] I> BR-BCT: preprod_dev_sign: 0
[0000.085] I> rst_source: 0x2, rst_level: 0x1

I> rst_source: 0x2, rst_level: 0x1, it means that the system has been reset by WDT.

hello , JerryChang
So, is this watchdog timer implemented in software or hardware? Can we find the relevant code in the jetson_linux_r36.3.0?

We really want to understand:

  1. How the watchdog determines that a rootfs slot is unusable
  2. Under what conditions it decreases the retry count (ROOTFS_RETRY_COUNT_MAX)

hello jingsusu,

it’s software approach, for instance, once you’ve kernel panic reported for a short while, WDT (watch dog timer) will involved to trigger system reset, and then it’ll decreases the retry count ROOTFS_RETRY_COUNT_MAX for next boot-up cycle; after the retry count has reach 3 times, it’ll switch to another rootfs slot.

hello , JerryChang

Where is this software approach implemented? Is the source code located in a specific software DEB package or in jetson_linux_r36.3.0?

you may see-also UEFI sources, L4TConfiguration.dts for the RootfsRetryCountMax property.