The latest kernel from JetPack 4.6.3 has issues when the preempt-rt patches are applied.
The kernel locks up and reboots.
This can been seen with the JetPack Ubuntu distro as well as yocto distro.
Here are the steps to reproduced the issue with the Ubuntu distro:
docker load -i ./sdkmanager-1.9.2.10899-Ubuntu_18.04_docker.tar.gz
docker run \
-it \
--privileged \
-v /dev/bus/usb:/dev/bus/usb/ \
-v /dev:/dev \
-v /media/$USER:/media/nvidia:slave \
-v /opt/sdkmanager/:/opt/sdkmanager/ \
-v /home/[username]/nvidia_sdk:/home/nvidia/nvidia/nvidia_sdk \
--name JetPack_TX2_Devkit \
--network host \
--entrypoint /bin/bash \
sdkmanager
docker exec -it JetPack_TX2_Devkit /bin/bash
# Download packages
/opt/nvidia/sdkmanager/sdkmanager \
--cli downloadonly \
--logintype devzone \
--product Jetson \
--host \
--targetos Linux \
--version ${2} \
--target JETSON_TX2_TARGETS \
--select 'Jetson OS' \
--flash skip \
--downloadfolder /opt/sdkmanager/ \
--license accept \
--staylogin true \
--datacollection disable \
--exitonfinish
# Flash the stock image and verify the device is working
/opt/nvidia/sdkmanager/sdkmanager \
--cli install \
--offline \
--product Jetson \
--host \
--targetos Linux \
--version ${3} \
--target JETSON_TX2TARGETS \
--select 'Jetson OS' \
--flash all \
--downloadfolder /opt/sdkmanager/ \
--license accept \
--datacollection disable \
--exitonfinish
# Apply the preempt-rt patches, recompile the kernel and re-flash the device
sudo apt install -y build-essential git-core bc
cd ~/nvidia/nvidia_sdk/JetPack_4.6.3_Linux_JETSON_TX2_TARGETS/Linux_for_Tegra/
./source_sync.sh
# use tegra-l4t-r32.7.3 tag name when queried
cd sources/kernel/kernel-4.9/scripts/
./rt-patch.sh apply-patches
cd ..
git config --global user.email "you@example.com"
git config --global user.name "Your Name"
git add *
git commit -m "Applied RT-PREEMPT patches"
mkdir build
export RELEASE_PACKAGEP=/home/nvidia/nvidia/nvidia_sdk/JetPack_4.6.3_Linux_JETSON_TX2_TARGETS/Linux_for_Tegra
export TEGRA_KERNEL_OUT=$RELEASE_PACKAGEP/sources/kernel/kernel-4.9/build
export CROSS_COMPILE=/usr/bin/aarch64-linux-gnu-
export LOCALVERSION=-tegra-rt
make ARCH=arm64 O=$TEGRA_KERNEL_OUT tegra_defconfig
make ARCH=arm64 O=$TEGRA_KERNEL_OUT -j4
cp $TEGRA_KERNEL_OUT/arch/arm64/boot/Image $RELEASE_PACKAGEP/kernel/Image
cp -r $TEGRA_KERNEL_OUT/arch/arm64/boot/dts/ $RELEASE_PACKAGEP/kernel/dtb/
sudo make ARCH=arm64 O=$TEGRA_KERNEL_OUT modules_install INSTALL_MOD_PATH=$RELEASE_PACKAGEP/rootfs/
cd $RELEASE_PACKAGEP/rootfs/
sudo tar --owner root --group root -cjf kernel_supplements.tbz2 lib/modules
sudo mv kernel_supplements.tbz2 $RELEASE_PACKAGEP/kernel/
sudo ./apply_binaries.sh
sudo ./flash.sh jetson-tx2-devkit mmcblk0p1
Now the device reboots in an infinite loop before I can make it to the login.
With a Yocto (meta-tegra) based images that I use for production devices and based on this latest kerner , it is a little less obvious, probably because of the different packages installed. But there the kernel locks and the device reboots too:
the most obvious → type dmesg: immediate lock and reboot
run smartctl: a lock happens somewhere and I lose access to the SATA drive - no reboot but I cannot recover from the error
stepping through the code using gdbserver – random lock and reboot
There is no problem with 4.6.2, this is a new issue.
The issue has already been reported in other topics:
Yes this is correct. The patch just made dmesg not print anything, to prevent the reboot but broke the dmesg functionality. It didn’t help.
I found 2 other ways to get locks using smartctl and gdbserver. There probably are others since trying to use the Ubuntu distro, the device keeps rebooting before I get a chance to login.
I just tested with running smartctl on SATA drives, but it did not seem to cause anything abnormal.
Is it something that’s 100% guaranteed to happen?
Can you share what exact commands do you use in the use case?
I think it just disables part of dmesg logs, but not all of them? Are you saying you get nothing from dmesg with the patch?
Can you apply the patch and run longer to find if there are other ways to trigger this bug?
For the SATA drive lock, I run this command "/usr/sbin/smart-log --test" after about a minute, I lose access to the disk. It happens 100% of the time and that’s using our meta-tegra based distro. I cannot log in the Ubuntu distro since the device keeps rebooting.
It disables all of them. But I think that’s the easiest command to use to find the root cause because it happen systematically. Looking at the syslog_print_all source some locking is happening there. I don’t know how to debug this.
I’m not able to run the Ubuntu distro because of the reboots. With the meta-tegra distro I was only able to find the previously mentioned issues and spent days on it.
I tried the suggested fix with the ubuntu distro and I get an output for dmesg. I don’t with yocto, I don’t know why.
I also gave a shot at the R32.7.4 kernel with the yocto distribution and get now different locks during the boot and systemd gets blocked and never completes
This image has the minimum required packages to boot with systemd to a shell.
Here are the steps to build this image:
cd ~
git clone -b dunfell https://github.com/OE4T/tegra-demo-distro.git
cd tegra-demo-distro
git submodule update --init
. ./setup-env --machine jetson-tx2-devkit
devtool modify linux-tegra
cd workspace/sources/linux-tegra/scripts
./rt-patch.sh apply-patches
cd ~/tegra-demo-distro/build
bitbake demo-image-base
cd tmp/deploy/images/jetson-tx2-devkit/
mkdir image
cd image
tar -xvzf ../demo-image-base-jetson-tx2-devkit.tegraflash.tar.gz
# Put the devkit in flashing mode
./doflash.sh
just want to clarify that, do you still hit kernel locks you mentioned previously with our official BSP?
The situation may be different if it only happens on Yocto build.
I figured out why we see dmesg prints in the Ubuntu distribution and I don’t in our yocto based image. The Ubuntu image uses dmesg from util-linux and our image the busybox implementation.
util-linux reads from /dev/kmsg while busybox uses syslog. So syslog is broken in the latest releases.
Regarding the kernel locks. I don’t see them (dmesg and smartctl) in the latest Ubuntu image but I cannot compile/run our application due to complex dependencies. I’ll get to that as last resort.
The yocto build hangs at boot and systemd cannot complete the boot sequence. nvphsd_setup.sh and nvpmodel are stuck. Here are the traces: frozen_boot.txt (81.8 KB)
I have made a yocto build using the stock kernel. The only difference is that the several NVidia source repositories are merged into one to be able to build from the yocto environment. See GitHub - OE4T/linux-tegra-4.9 at l4t-r32.7.4-base
if the official BSP works fine, then please just use it.
Yocto is not officially maintained by NVIDIA, and we do not support issues related to custom Yocto build.
Using NVidia distro, I reverted the commit b3fb2b5173662 enabled more lock debugging flags in the kernel config .config (163.3 KB), rebuilt and flashed the device.
I am back at the boot loop but I can see more information.
The patch mentioned as solution does not apply and has few errors
../init/do_mounts_rd.c: In function 'rd_load_image':
../init/do_mounts_rd.c:272:3: error: implicit declaration of function 'sys_write'; did you mean 'sys_writev'? [-Werror=implicit-function-declaration]
sys_write(out_fd, buf, BLOCK_SIZE);
^~~~~~~~~
sys_writev
CC arch/arm64/kernel/return_address.o
LD firmware/built-in.o
../init/initramfs.c: In function 'xwrite':
../init/initramfs.c:30:16: error: implicit declaration of function 'sys_write'; did you mean 'sys_writev'? [-Werror=implicit-function-declaration]
ssize_t rv = sys_write(fd, p, count);
^~~~~~~~~
sys_writev
CC arch/arm64/kernel/cpuinfo.o
AS arch/arm64/lib/bitops.o
CC sound/core/sound.o
CC arch/arm64/kernel/cpu_errata.o
CC arch/arm64/kernel/cpufeature.o
CC virt/lib/irqbypass.o
../include/uapi/asm-generic/unistd.h:206:23: error: 'sys_write' undeclared here (not in a function); did you mean 'sys_writev'?
__SYSCALL(__NR_write, sys_write)
^
../arch/arm64/kernel/sys.c:56:35: note: in definition of macro '__SYSCALL'
#define __SYSCALL(nr, sym) [nr] = sym,
Stumbled across this thread because I am having issues upgrading to JetPack 4.6.4 from 4.6 with the RT patches applied on my TX2. Boot hangs about 70% of the time, worked fine with 4.6, so I know what I’m doing in terms of using the script to apply these patches. Staying at 4.6 is not an option since I need to support the newer TX2 modules with the hardware mod.
This is a very fresh error on my end and I will be diving into it over the next few days, but I wanted to mention my problem here so that Nvidia knows that it is not just a single user having RT patch issues with the latest/greatest version of JetPack (and underlying kernel).
After testing, I am seeing a number of issues with R32.7.4 with the RT patches applied.
If nvpmodel runs at boot, I get a boot hang. If this service fails (which seems to happen more often than not), I am able to boot just fine. I am unable to get the system to hang after a successful boot via re-running nvpmodel. If I disable the service, I am able to boot with 100% success. If I try to change what nvpmodel mode is applied by default (including keeping all cores active), boot still hangs if the service runs.
rt-tests fails in spectacular fashion if I do get a good boot. Namely, sigwaittest -t 4 -f shows latency spikes on the order of several milliseconds (I would expect my worst case performance to be closer to the average, which is a few microseconds). Also, rt-migrate-test -c -p 60 fails repeatedly showing that lower priority tasks are scheduled ahead of high priority tasks. Note that all these tests were run with the Denver cores disabled.
If I increase kernel debug flags in my kernel config, I am able to duplicate @damien.lefevre results, and also see a few additional kernel BUGs. I attempted to apply the fix I found from another post (R32.7.1 / 4.9.253-rt168 : BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:987), which did not resolve any of my issues (including kernel BUGs being reported due to the DEBUG flags being set). In fact, the only thing that seems to resolve my issues is not applying the RT patches, which is unfortunately a requirement for my use case.
@DaveYYY When can we expect the RT patchset to be supported in L4T (again) for the TX2?