Cboot rollback bug with RT PREEMPT Kernel on TX2

Hello,

I am encountering an issue with cboot and a real-time kernel in Jetpack 32.5.2 on a Jetsons TX2. It appears that cboot intermittently reverts to the other rootfs partition when I’m using a Linux Kernel compiled with RT PREEMPT patches.

Here is the current setup: I have a Jetsons TX2 running a Yocto distribution based on meta-tegra in the Dunfell 32.5.2 branch. This distribution handles software updates with Mender.

Here’s the testing scenario: I’ve conducted a cycle of 100 Mender updates on the device. Each update triggers an A/B rootfs partition switch. However, at some point during a reboot, the device rolls back to the previous partition. I’ve monitored the logs on the serial port to check if the error might be related to Mender, cboot, or the Linux kernel. The issue is that the rollback error occurs before any output appears on the serial port. This suggests that the error occurs either before or during cboot execution since cboot typically logs information on the serial port during the boot process. There are no error indications in the logs for the Linux kernel or the Mender update stages.

To replicate the error, you can use the meta-tegra demo distribution and their test script for Mender updates. Make sure you have a Jetson TX2 development kit connected to the local network. Here are the steps to build and flash the OS with a RT Kernel:

from the demo distro. Checkout the dunfell-l4t-r32.5.0 branch: GitHub - OE4T/tegra-demo-distro at dunfell-l4t-r32.5.0

git clone  https://github.com/OE4T/tegra-demo-distro
cd tegra-demo-distro
git checkout dunfell-l4t-r32.5.0
git submodule update --init

apply RT patches to the demo distro

source setup-env --machine jetson-tx2-devkit --distro tegrademo-mender build
devtool modify linux-tegra
cd workspace/sources/linux-tegra/
./scripts/rt-patch.sh apply-patches

build the image

bitbake demo-image-base

put the jetson in recovery mode and plug USB cable and use lsusb to check NVIDIA device presence and flash the image:

cd tmp/deploy/image/jetson-tx2-devkit/
mkdir flash
tar -C flash -xvf demo-image-base-jetson-tx2-devkit.tegraflash.tar.gz
cd flash
sudo ./doflash

connect the jetson to the network, it will get an IP with dhcp. search for it with nmap or check with serial connection that gives you a shell. login as root without password and check mender and kernel versions (check for RT tag)

root@j140-tx2-d02:~# mender --version
2.6.1 runtime: go1.14.15
root@j140-tx2-d02:~# uname -a
Linux j140-tx2-d02 4.9.201-rt134-l4t-r32.5+g618f59196be6 #1 SMP PREEMPT RT Fri Sep 8 09:18:27 UTC 2023 aarch64 GNU/Linux

on your laptop, setup a web server for the stress test

cd tmp/deploy/image/jetson-tx2-devkit/
# or cd .. from last step
python3 -m http.server 8080

in a new terminal, run the stress test to trigger the unwanted rollback

cd tegra-demo-distro/layers/meta-mender-tegra/scripts/test
python3 -m pip install -r requirements.txt
./mender_tegra_test.py --test mender_torture --device <Jetson_IP> --mender_install http://<Laptop_IP>:8080/demo-image-base-jetson-tx2-devkit.mender 2>&1 | tee -a logfile.log

At some point, the script should crash due to the rollback occurring. You can monitor the process on the serial connection. Could you please investigate this error to understand the behavior on the TX2 with a real-time kernel?

If you have any further questions or need additional information, please let me know.

Perceval

Hi,

32.5.2 is quite a old version, and we do not suggest using it anymore.
Yocto is also not officially maintained by NVIDIA, so there might be something different in behavior between a Yocto build and an official L4T build.
So I’d suggest checking if the situation also happens on the latest 32.7.4 L4T build.

Hi DaveYYY,

Thank you for your answer.

Actually, we cannot update our devices as they are already in production. A JetPack update would imply a change in the partition layout that would force us to reflash all the devices by hand. It is impossible to do that because of the volume and the way the devices are mounted in production.

That’s why we need to investigate this issue on this specific version.
How can we move forward on this?

Hi,

please at least make sure it also happens on our official L4T build. (32.5.2)
We don’t support issues happening on Yocto build.

If that means switching to the official L4T build is also not feasible, then there is little we can do.

Hi,

I’m trying to conduct some tests with the latest yocto build version supported by TX2 and with pure L4T 32.5.2.

One of the problem I have is that I cannot get any logs during the rollback phase. Is there any output or method to get more logs from cboot or earlier boot process steps?

Because when I get logs on the serial interface it is already too late and nothing in there is really helpful.
Could you help me get some logs?

Hi,

did you remove quite from /boot/extlinux/extlinux.conf?