JetPack 4.6.3 preempt-rt patkernel: reboot loop

Hello,

The latest kernel from JetPack 4.6.3 has issues when the preempt-rt patches are applied.

The kernel locks up and reboots.

This can been seen with the JetPack Ubuntu distro as well as yocto distro.

Here are the steps to reproduced the issue with the Ubuntu distro:

docker load -i ./sdkmanager-1.9.2.10899-Ubuntu_18.04_docker.tar.gz
docker run \
    -it \
    --privileged \
    -v /dev/bus/usb:/dev/bus/usb/ \
    -v /dev:/dev \
    -v /media/$USER:/media/nvidia:slave \
    -v /opt/sdkmanager/:/opt/sdkmanager/ \
    -v /home/[username]/nvidia_sdk:/home/nvidia/nvidia/nvidia_sdk \
    --name JetPack_TX2_Devkit \
    --network host \
    --entrypoint /bin/bash \
    sdkmanager

docker exec -it JetPack_TX2_Devkit /bin/bash

# Download packages
/opt/nvidia/sdkmanager/sdkmanager \
    --cli downloadonly  \
    --logintype devzone \
    --product Jetson \
    --host \
    --targetos Linux \
    --version ${2} \
    --target JETSON_TX2_TARGETS \
    --select 'Jetson OS' \
    --flash skip \
    --downloadfolder /opt/sdkmanager/ \
    --license accept \
    --staylogin true \
    --datacollection disable \
    --exitonfinish

# Flash the stock image and verify the device is working
/opt/nvidia/sdkmanager/sdkmanager \
    --cli install  \
    --offline \
    --product Jetson \
    --host \
    --targetos Linux \
    --version ${3} \
    --target JETSON_TX2TARGETS \
    --select 'Jetson OS' \
    --flash all \
    --downloadfolder /opt/sdkmanager/ \
    --license accept \
    --datacollection disable \
    --exitonfinish

# Apply the preempt-rt patches, recompile the kernel and re-flash the device
sudo apt install -y build-essential git-core bc
cd ~/nvidia/nvidia_sdk/JetPack_4.6.3_Linux_JETSON_TX2_TARGETS/Linux_for_Tegra/
./source_sync.sh
# use tegra-l4t-r32.7.3 tag name when queried 

cd sources/kernel/kernel-4.9/scripts/
./rt-patch.sh apply-patches
cd ..
git config --global user.email "you@example.com"
git config --global user.name "Your Name"
git add *
git commit -m  "Applied RT-PREEMPT patches"
mkdir build
export RELEASE_PACKAGEP=/home/nvidia/nvidia/nvidia_sdk/JetPack_4.6.3_Linux_JETSON_TX2_TARGETS/Linux_for_Tegra
export TEGRA_KERNEL_OUT=$RELEASE_PACKAGEP/sources/kernel/kernel-4.9/build
export CROSS_COMPILE=/usr/bin/aarch64-linux-gnu-
export LOCALVERSION=-tegra-rt

make ARCH=arm64 O=$TEGRA_KERNEL_OUT tegra_defconfig
make ARCH=arm64 O=$TEGRA_KERNEL_OUT -j4

cp $TEGRA_KERNEL_OUT/arch/arm64/boot/Image $RELEASE_PACKAGEP/kernel/Image
cp -r $TEGRA_KERNEL_OUT/arch/arm64/boot/dts/ $RELEASE_PACKAGEP/kernel/dtb/
sudo make ARCH=arm64 O=$TEGRA_KERNEL_OUT modules_install INSTALL_MOD_PATH=$RELEASE_PACKAGEP/rootfs/

cd $RELEASE_PACKAGEP/rootfs/
sudo tar --owner root --group root -cjf kernel_supplements.tbz2 lib/modules
sudo mv kernel_supplements.tbz2 $RELEASE_PACKAGEP/kernel/

sudo ./apply_binaries.sh
sudo ./flash.sh jetson-tx2-devkit mmcblk0p1

Now the device reboots in an infinite loop before I can make it to the login.

With a Yocto (meta-tegra) based images that I use for production devices and based on this latest kerner , it is a little less obvious, probably because of the different packages installed. But there the kernel locks and the device reboots too:

  • the most obvious → type dmesg: immediate lock and reboot
  • run smartctl: a lock happens somewhere and I lose access to the SATA drive - no reboot but I cannot recover from the error
  • stepping through the code using gdbserver – random lock and reboot

There is no problem with 4.6.2, this is a new issue.

The issue has already been reported in other topics:

Oh and I forgot to mention that I get the same issue with both TX2 and Xavier AGX devkits.

Hi,

thanks for reporting.

I want to make sure that the patch shared below just prevents the device from rebooting, but kernel will still gets locked in other use cases, right?

Have you experienced other scenarios that also cause kernel to lock?

Hi,

Yes this is correct. The patch just made dmesg not print anything, to prevent the reboot but broke the dmesg functionality. It didn’t help.

I found 2 other ways to get locks using smartctl and gdbserver. There probably are others since trying to use the Ubuntu distro, the device keeps rebooting before I get a chance to login.

Hi,

I just tested with running smartctl on SATA drives, but it did not seem to cause anything abnormal.
Is it something that’s 100% guaranteed to happen?
Can you share what exact commands do you use in the use case?

I think it just disables part of dmesg logs, but not all of them? Are you saying you get nothing from dmesg with the patch?

Can you apply the patch and run longer to find if there are other ways to trigger this bug?

Hi,

For the SATA drive lock, I run this command "/usr/sbin/smart-log --test" after about a minute, I lose access to the disk. It happens 100% of the time and that’s using our meta-tegra based distro. I cannot log in the Ubuntu distro since the device keeps rebooting.

It disables all of them. But I think that’s the easiest command to use to find the root cause because it happen systematically. Looking at the syslog_print_all source some locking is happening there. I don’t know how to debug this.

I’m not able to run the Ubuntu distro because of the reboots. With the meta-tegra distro I was only able to find the previously mentioned issues and spent days on it.

BTW according to this post Device reboots after using the RT-Kernel patch [JP4.6.3] the pre-built rt kernel deb package still uses an older kernel version. So it does need to be re-built to perform this test.

Hi,

May I know how did you install smartctl? smart-log does not seem to be available with smartmontools, which I just installed.

Have you really applied the patch and verify on your device?
Like, I still have logs pop up when I plug in the SATA drive:

[18185.113085] tegra-xusb 3530000.xhci: exiting ELPG
[18185.116685] tegra-xusb 3530000.xhci: Firmware timestamp: 2020-07-06 13:39:28 UTC, Version: 55.18 release
[18185.242181] tegra-xusb 3530000.xhci: exiting ELPG done
[18185.242423] usb usb1: usb_suspend_both: status 0
[18185.498464] usb 2-1: new SuperSpeed USB device number 2 using tegra-xusb
[18185.518972] usb 2-1: New USB device found, idVendor=174c, idProduct=55aa
[18185.518977] usb 2-1: New USB device strings: Mfr=2, Product=3, SerialNumber=1
[18185.518980] usb 2-1: Product: Ugreen Storage Device
[18185.518982] usb 2-1: Manufacturer: Ugreen
[18185.518984] usb 2-1: SerialNumber: 123458CC01B1
[18185.526404] scsi host2: uas
[18185.537662] scsi 2:0:0:0: Direct-Access ADATA SX 900 0 PQ: 0 ANSI: 6
[18185.598911] sd 2:0:0:0: [sda] 1000215216 512-byte logical blocks: (512 GB/477 GiB)
[18185.607281] sd 2:0:0:0: [sda] Write Protect is off
[18185.612077] sd 2:0:0:0: [sda] Mode Sense: 43 00 00 00
[18185.612422] sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn’t support DPO or FUA
[18185.621923] sd 2:0:0:0: [sda] Optimal transfer size 33553920 bytes
[18185.625762] sd 2:0:0:0: [sda] Attached SCSI disk
[18185.913432] EXT4-fs (sda): mounted filesystem with ordered data mode. Opts: (null)

Same as the previous question. Have you applied the patch and tested on your device?

Ah sorry about that, it’s an internal wrapper. The real command is smartctl --test=short /dev/sda

I did on our yocto distro. The output was empty.

I’ll be able to recompile for the Ubuntu image and devkit next Monday.

1 Like

Hi @DaveYYY

I tried the suggested fix with the ubuntu distro and I get an output for dmesg. I don’t with yocto, I don’t know why.

I also gave a shot at the R32.7.4 kernel with the yocto distribution and get now different locks during the boot and systemd gets blocked and never completes

Here is the console output:
systemd_blocked.txt (103.4 KB)

This image has the minimum required packages to boot with systemd to a shell.

Here are the steps to build this image:

cd ~
git clone -b dunfell https://github.com/OE4T/tegra-demo-distro.git
cd tegra-demo-distro
git submodule update --init
. ./setup-env --machine jetson-tx2-devkit
devtool modify linux-tegra
cd workspace/sources/linux-tegra/scripts
./rt-patch.sh apply-patches
cd ~/tegra-demo-distro/build
bitbake demo-image-base
cd tmp/deploy/images/jetson-tx2-devkit/
mkdir image 
cd image
tar -xvzf ../demo-image-base-jetson-tx2-devkit.tegraflash.tar.gz
# Put the devkit in flashing mode
./doflash.sh

Hi,

just want to clarify that, do you still hit kernel locks you mentioned previously with our official BSP?
The situation may be different if it only happens on Yocto build.

Hi,

I’ll be able to test the official BSP after 5 weeks now. But since the yocto build uses the R32.7.4 sources, I’d expect the same behavior.

I’m not sure if I mentioned it to you but up to R32.7.2 there were no issues.

Hi @DaveYYY

I’m back at this now.

I figured out why we see dmesg prints in the Ubuntu distribution and I don’t in our yocto based image. The Ubuntu image uses dmesg from util-linux and our image the busybox implementation.

util-linux reads from /dev/kmsg while busybox uses syslog. So syslog is broken in the latest releases.

Here is the strace output for both:
util-linux-dmesg-traces.txt (6.1 KB)
busybox-dmesg-traces.txt (3.1 KB)

Regarding the kernel locks. I don’t see them (dmesg and smartctl) in the latest Ubuntu image but I cannot compile/run our application due to complex dependencies. I’ll get to that as last resort.

The yocto build hangs at boot and systemd cannot complete the boot sequence. nvphsd_setup.sh and nvpmodel are stuck. Here are the traces:
frozen_boot.txt (81.8 KB)

I’ll study the oe4t additional patches next.

Hi @DaveYYY,

I have made a yocto build using the stock kernel. The only difference is that the several NVidia source repositories are merged into one to be able to build from the yocto environment. See GitHub - OE4T/linux-tegra-4.9 at l4t-r32.7.4-base

0c8e1518755e6106dd0aaac665bf9766390ebbca : Build system updates for merged nvidia tree.

629740abcf108b76c90c369cea6db3e4e5005a4c : nvidia: tegra: host: remove xxd reference from Makefile

1b233ec4b1865efc9a1a28b8e2a691c0bcad85c3 : scripts: update ‘the-space’ definition

328575cfe51785bbe3b1e37218ba74f317a32e01 : kbuild: change kernel version line order in Makefile for Yocto-devtoo…

eee9b9dcf587e3ea7432f9033e36a0ad4800aaa4 : ARM64: update NVIDIA dtb/dtbo generation targets

7d8da3b07712fef0ea584912e5ab2724cda68191 : ARM64: dts: pass base name of file to cpp

2fbb2ed088884cfbb83f298fd3920f3c11b3890f : bcmdhd: fix strncpy bounds

84679eae7eb6f87a9fc4b988e182c3e88f9ccbd0 : nvidia: net/wireless/rtl88XX: fix maybe-uninitialized compilation error

eb8934582bf7224e6348aa7a8f394a5ed5cb5bbe : nvidia: net/wireless/rtl8812: fix tautological comparison

Here systemd also hangs at boot:
systemd_boot_hang_stock.md (13.6 KB)

What other next steps would you suggest to fix this issue?

Hi,

if the official BSP works fine, then please just use it.
Yocto is not officially maintained by NVIDIA, and we do not support issues related to custom Yocto build.

@DaveYYY this can be argued, but this commit b3fb2b5173662 is just a band aid hiding a much bigger problem

b3fb2b5173662 kernel/printk/printk.c (Your Name                2023-08-22 08:24:07 +0000 1578) 
b3fb2b5173662 kernel/printk/printk.c (Your Name                2023-08-22 08:24:07 +0000 1579)  /* WAR: Bug 3991287 */
b3fb2b5173662 kernel/printk/printk.c (Your Name                2023-08-22 08:24:07 +0000 1580)  /*      error = syslog_print_all(buf, len, clear); */

So reverting it and looking at why the kernel is locking up probably would reveal the real issue. This is not a fix.

We didn’t have problem with 4.6.2 but now we do. We cannot stay at 4.6.2 with TX2 because of new hynix chips.

With different hardware configuration and software packages, different customers will hit the locks differently.

Then we need a method that can consistently re-produce the kernel locks.
I didn’t see it with the SATA drive you mentioned earlier.

Hi @DaveYYY.

The most obvious way to replicate the deadlock is by reverting the commit b3fb2b5173662 I mentioned previously. You’ll end up in a boot loop.

Do we have a QEMU environment or any way to hook up gdb?

Hi @DaveYYY

Using NVidia distro, I reverted the commit b3fb2b5173662 enabled more lock debugging flags in the kernel config .config (163.3 KB), rebuilt and flashed the device.

I am back at the boot loop but I can see more information.

There’s are 3 lock errors detected, see kernel_lock_console.txt (69.6 KB).

Ex:

[    5.254801] BUG: sleeping function called from invalid context at ../kernel/locking/rtmutex.c:987
[    5.254804] in_atomic(): 1, irqs_disabled(): 128, pid: 5, name: kworker/u12:0
[    5.254805] INFO: lockdep is turned off.
[    5.254808] irq event stamp: 0
[    5.254810] hardirqs last  enabled at (0): [<          (null)>]           (null)
[    5.254820] hardirqs last disabled at (0): [<        (ptrval)>] copy_process.isra.7.part.8+0x2a8/0x18d0
[    5.254823] softirqs last  enabled at (0): [<        (ptrval)>] copy_process.isra.7.part.8+0x2a8/0x18d0
[    5.254825] softirqs last disabled at (0): [<          (null)>]           (null)
[    5.254831] Preemption disabled at:
[    5.254832] [<        (ptrval)>] tegra_i2c_xfer_msg+0x1d8/0xc38
[    5.254838] CPU: 0 PID: 5 Comm: kworker/u12:0 Tainted: G        W       4.9.337-rt197-d-tegra-rt #4
[    5.254840] Hardware name: quill (DT)
[    5.254852] Workqueue: events_unbound async_run_entry_fn
[    5.254854] Call trace:
[    5.254858] [<        (ptrval)>] dump_backtrace+0x0/0x1a0
[    5.254861] [<        (ptrval)>] show_stack+0x24/0x30
[    5.254865] [<        (ptrval)>] dump_stack+0xa4/0xd4
[    5.254870] [<        (ptrval)>] ___might_sleep+0x15c/0x230
[    5.254873] [<        (ptrval)>] rt_spin_lock+0x30/0x70
[    5.254878] [<        (ptrval)>] devres_add+0x2c/0x70
[    5.254881] [<        (ptrval)>] devm_kmalloc+0x5c/0x88
[    5.254886] [<        (ptrval)>] tegra_dma_sg_req_get+0xa4/0xc8
[    5.254889] [<        (ptrval)>] tegra_dma_prep_slave_sg+0x190/0x498
[    5.254892] [<        (ptrval)>] tegra_i2c_xfer_msg+0x238/0xc38
[    5.254894] [<        (ptrval)>] tegra_i2c_xfer+0x5e4/0x7a8
[    5.254897] [<        (ptrval)>] __i2c_transfer+0x128/0x9e0
[    5.254900] [<        (ptrval)>] i2c_transfer+0x98/0x108
[    5.254905] [<        (ptrval)>] tegra_hdmi_ddc_i2c_xfer+0x70/0x158
[    5.254909] [<        (ptrval)>] tegra_edid_read_block+0xc0/0x260
[    5.254913] [<        (ptrval)>] tegra_edid_get_monspecs+0x6c/0xbd8
[    5.254917] [<        (ptrval)>] tegra_hdmi_controller_enable+0x2e0/0xf80
[    5.254920] [<        (ptrval)>] tegra_dc_hdmi_enable+0x48/0xb8
[    5.254924] [<        (ptrval)>] tegra_nvdisp_head_enable+0x4a0/0x13b8
[    5.254927] [<        (ptrval)>] _tegra_dc_enable+0xf0/0x110
[    5.254930] [<        (ptrval)>] tegra_dc_probe+0x1204/0x1ac0
[    5.254933] [<        (ptrval)>] platform_drv_probe+0x60/0xc0
[    5.254936] [<        (ptrval)>] driver_probe_device+0x298/0x448
[    5.254938] [<        (ptrval)>] __driver_attach+0x110/0x138
[    5.254941] [<        (ptrval)>] bus_for_each_dev+0x5c/0xa8
[    5.254944] [<        (ptrval)>] driver_attach+0x30/0x40
[    5.254946] [<        (ptrval)>] driver_attach_async+0x20/0x60
[    5.254949] [<        (ptrval)>] async_run_entry_fn+0x48/0x150
[    5.254953] [<        (ptrval)>] process_one_work+0x288/0x7d8
[    5.254956] [<        (ptrval)>] worker_thread+0x50/0x4d0
[    5.254958] [<        (ptrval)>] kthread+0xf4/0xf8
[    5.254961] [<        (ptrval)>] ret_from_fork+0x10/0x50

I found a similar bug report R32.7.1 / 4.9.253-rt168 : BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:987

The patch mentioned as solution does not apply and has few errors

../init/do_mounts_rd.c: In function 'rd_load_image':
../init/do_mounts_rd.c:272:3: error: implicit declaration of function 'sys_write'; did you mean 'sys_writev'? [-Werror=implicit-function-declaration]
   sys_write(out_fd, buf, BLOCK_SIZE);
   ^~~~~~~~~
   sys_writev
  CC      arch/arm64/kernel/return_address.o
  LD      firmware/built-in.o
../init/initramfs.c: In function 'xwrite':
../init/initramfs.c:30:16: error: implicit declaration of function 'sys_write'; did you mean 'sys_writev'? [-Werror=implicit-function-declaration]
   ssize_t rv = sys_write(fd, p, count);
                ^~~~~~~~~
                sys_writev
  CC      arch/arm64/kernel/cpuinfo.o
  AS      arch/arm64/lib/bitops.o
  CC      sound/core/sound.o
  CC      arch/arm64/kernel/cpu_errata.o
  CC      arch/arm64/kernel/cpufeature.o
  CC      virt/lib/irqbypass.o
../include/uapi/asm-generic/unistd.h:206:23: error: 'sys_write' undeclared here (not in a function); did you mean 'sys_writev'?
 __SYSCALL(__NR_write, sys_write)
                       ^
../arch/arm64/kernel/sys.c:56:35: note: in definition of macro '__SYSCALL'
 #define __SYSCALL(nr, sym) [nr] = sym,

Can you reproduce the same on your side?

1 Like

Stumbled across this thread because I am having issues upgrading to JetPack 4.6.4 from 4.6 with the RT patches applied on my TX2. Boot hangs about 70% of the time, worked fine with 4.6, so I know what I’m doing in terms of using the script to apply these patches. Staying at 4.6 is not an option since I need to support the newer TX2 modules with the hardware mod.

This is a very fresh error on my end and I will be diving into it over the next few days, but I wanted to mention my problem here so that Nvidia knows that it is not just a single user having RT patch issues with the latest/greatest version of JetPack (and underlying kernel).

After testing, I am seeing a number of issues with R32.7.4 with the RT patches applied.

  1. If nvpmodel runs at boot, I get a boot hang. If this service fails (which seems to happen more often than not), I am able to boot just fine. I am unable to get the system to hang after a successful boot via re-running nvpmodel. If I disable the service, I am able to boot with 100% success. If I try to change what nvpmodel mode is applied by default (including keeping all cores active), boot still hangs if the service runs.
  2. rt-tests fails in spectacular fashion if I do get a good boot. Namely, sigwaittest -t 4 -f shows latency spikes on the order of several milliseconds (I would expect my worst case performance to be closer to the average, which is a few microseconds). Also, rt-migrate-test -c -p 60 fails repeatedly showing that lower priority tasks are scheduled ahead of high priority tasks. Note that all these tests were run with the Denver cores disabled.
  3. If I increase kernel debug flags in my kernel config, I am able to duplicate @damien.lefevre results, and also see a few additional kernel BUGs. I attempted to apply the fix I found from another post (R32.7.1 / 4.9.253-rt168 : BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:987), which did not resolve any of my issues (including kernel BUGs being reported due to the DEBUG flags being set). In fact, the only thing that seems to resolve my issues is not applying the RT patches, which is unfortunately a requirement for my use case.

@DaveYYY When can we expect the RT patchset to be supported in L4T (again) for the TX2?