RT patches leading to warning and ultimately deadlocked system

cquast · October 27, 2020, 12:39pm

If the Jetson TX2 devkit is stress tested using stress-ng on all 6 cores, while scheduling additional threads using pthread, eventually a lot of warnings are printed:

[ 1839.855290] WARNING: CPU: 0 PID: 1 at tx2_nvidia_root/sources/kernel/kernel-4.9/kernel/cpu.c:353 unpin_current_cpu+0x70/0x78
[ 1839.855295] Modules linked in: nvgpu bcmdhd cfg80211 mttcan can_dev bluedroid_pm spidev
[ 1839.855296]
[ 1839.855298] CPU: 0 PID: 1 Comm: systemd Tainted: G W 4.9.140-rt93+ #1
[ 1839.855299] Hardware name: quill (DT)
[ 1839.855300] task: ffffffc1ec330000 task.stack: ffffffc1ec338000
[ 1839.855301] PC is at unpin_current_cpu+0x70/0x78
[ 1839.855303] LR is at unpin_current_cpu+0x24/0x78
[ 1839.855304] pc : lr : pstate: 60400045
[ 1839.855305] sp : ffffffc1ec33b8f0
[ 1839.855307] x29: ffffffc1ec33b8f0 x28: 0000000000000010
[ 1839.855309] x27: ffffff800a161378 x26: 0000000000000013
[ 1839.855310] x25: 0000000000000021 x24: 0000000000000000
[ 1839.855312] x23: ffffff8009e79688 x22: 0000000000000013
[ 1839.855314] x21: ffffff800a161d20 x20: ffffff800a161378
[ 1839.855315] x19: ffffff800983aa90 x18: 0000000000000010
[ 1839.855317] x17: 0000000000000000 x16: ffffffc1ec330000
[ 1839.855318] x15: ffffffffffffffff x14: ffffff808a161d22
[ 1839.855320] x13: ffffff800a161d30 x12: 0000000000000007
[ 1839.855321] x11: 0000000005f5e0ff x10: ffffff800a162000
[ 1839.855323] x9 : 00000000ffffffd0 x8 : ffffff80083d0840
[ 1839.855324] x7 : 0000000000003ff0 x6 : ffffff8008f88ab0
[ 1839.855326] x5 : 00000041ed7f3000 x4 : 0000000000000001
[ 1839.855327] x3 : ffffffc1ec330000 x2 : 00000041ed7f3000
[ 1839.855329] x1 : ffffffc1f702da90 x0 : 0000000000000000
[ 1839.855329]
[ 1839.855330] —[ end trace 000000000000043a ]—
[ 1839.855331] Call trace:
[ 1839.855333] unpin_current_cpu+0x70/0x78
[ 1839.855335] migrate_enable+0x194/0x270
[ 1839.855337] call_force_console_drivers+0xac/0xd0
[ 1839.855339] console_unlock.part.11+0x338/0x460
[ 1839.855340] console_device+0x58/0xd8
[ 1839.855342] tty_open+0x2b4/0x5a0
[ 1839.855344] chrdev_open+0x94/0x198
[ 1839.855346] do_dentry_open+0x1b8/0x318
[ 1839.855348] vfs_open+0x58/0x88
[ 1839.855350] do_last+0x454/0xe60
[ 1839.855351] path_openat+0x90/0x378
[ 1839.855353] do_filp_open+0x70/0xe8
[ 1839.855354] do_sys_open+0x174/0x258
[ 1839.855356] SyS_openat+0x3c/0x50
[ 1839.855358] el0_svc_naked+0x34/0x38
[ 1839.860123] ------------[ cut here ]------------

It seems to be different call graphs that lead to the warning in unpin_current_cpu:

[ 25.233262] Call trace:
[ 25.233264] unpin_current_cpu+0x70/0x78
[ 25.233266] migrate_enable+0x194/0x270
[ 25.233268] rt_read_unlock+0x34/0x40
[ 25.233272] led_trigger_event+0x6c/0x80
[ 25.233274] mmc_start_request+0x104/0x228
[ 25.233276] mmc_wait_for_req+0x7c/0xf8
[ 25.233278] mmc_io_rw_extended+0x2a4/0x2d8
[ 25.233279] sdio_io_rw_ext_helper+0xbc/0x218
[ 25.233281] sdio_readsb+0x44/0x58
[ 25.233394] sdioh_buffer_tofrom_bus+0xbc/0xe8 [bcmdhd]
[ 25.233503] sdioh_request_buffer+0x264/0x2d8 [bcmdhd]
[ 25.233617] bcmsdh_recv_buf+0x94/0x140 [bcmdhd]
[ 25.233725] dhdsdio_readframes+0x4ac/0x1fe0 [bcmdhd]
[ 25.233833] dhdsdio_dpc+0x998/0x1348 [bcmdhd]
[ 25.233942] dhd_bus_dpc+0x2c/0x68 [bcmdhd]
[ 25.234051] dhd_dpc_thread+0x118/0x238 [bcmdhd]
[ 25.234053] kthread+0xec/0xf0
[ 25.234055] ret_from_fork+0x10/0x30
[ 25.234084] ------------[ cut here ]------------
[ 25.234087] WARNING: CPU: 0 PID: 4071 at tx2_nvidia_root/sources/kernel/kernel-4.9/kernel/cpu.c:353 unpin_current_cpu+0x70/0x78
[ 25.234092] Modules linked in: nvgpu bcmdhd mttcan can_dev bluedroid_pm cfg80211 spidev uvcvideo
[ 25.234092]
[ 25.234094] CPU: 0 PID: 4071 Comm: dhd_dpc Tainted: G W 4.9.140-rt93+ #27
[ 25.234094] Hardware name: quill (DT)
[ 25.234095] task: ffffffc1d73d2d00 task.stack: ffffffc1d8b80000
[ 25.234097] PC is at unpin_current_cpu+0x70/0x78
[ 25.234099] LR is at unpin_current_cpu+0x24/0x78
[ 25.234100] pc : lr : pstate: 60400045
[ 25.234100] sp : ffffffc1d8b83700
[ 25.234102] x29: ffffffc1d8b83700 x28: 0000000000000001
[ 25.234104] x27: 0000000000000000 x26: 0000000000008000
[ 25.234105] x25: 0000000000000000 x24: 000000000000ffff
[ 25.234107] x23: 0000000000000080 x22: 0000000000000001
[ 25.234109] x21: ffffffc1e8bf1b80 x20: ffffffc1d8b83970
[ 25.234110] x19: ffffff800984aa90 x18: 0000000000000000
[ 25.234112] x17: 0000000000000000 x16: ffffffc1d73d2d00
[ 25.234113] x15: 0000000000000000 x14: 0000000000000070
[ 25.234115] x13: 000000000000001f x12: 0000000000000000
[ 25.234116] x11: ffffff8008f899d0 x10: 0000000000000b30
[ 25.234118] x9 : ffffffc1d8b839d0 x8 : ffffffc1d73d3890
[ 25.234119] x7 : 0000000000003ff0 x6 : ffffff80080a1e60
[ 25.234121] x5 : 00000041ed7e3000 x4 : 0000000000000001
[ 25.234122] x3 : ffffffc1d73d2d00 x2 : 00000041ed7e3000
[ 25.234124] x1 : ffffffc1f702da90 x0 : 0000000000000000
[ 25.234124]
[ 25.234125] —[ end trace 0000000000000960 ]—

The rt patches where taken from the 4.9 kernel directory in rt-patches and applied running the following from the vanialla nvidia kernel:

for i in rt-patches/*.patch; do patch -p1 < $i; done

Do you have any idea how to fix or debug that issue?

Best regards

sumitg · October 28, 2020, 11:21am

Please try after enabling the below debug macro’s in kernel defconfig ‘tegra_defconfig’ to get more debug info.
CONFIG_SCHED_DEBUG=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_RT_MUTEXES=y

Also, need the below details to try to reproduce locally:

Which release used?
steps to reproduce the issue?
reproduction rate and duration of the test?
‘stress-ng’ command used and parameter passed?
any local changes applied?

cquast · November 1, 2020, 8:16pm

Kernel used: tegra-l4t-r32.4.3

I enabled debugging, and the warnings stay the same.

[ 181.171322] WARNING: CPU: 1 PID: 4300 at kernel/cpu.c:353 unpin_current_cpu+0x70/0x78
[ 181.171325] Modules linked in: nvgpu bcmdhd mttcan cfg80211 can_dev bluedroid_pm spidev
[ 181.171326]
[ 181.171328] CPU: 1 PID: 4300 Comm: dmesg_handler.s Tainted: G W 4.9.140-rt93-gfb4356cad841 #5
[ 181.171328] Hardware name: quill (DT)
[ 181.171329] task: ffffffc1d8db1e80 task.stack: ffffffc1dbd54000
[ 181.171331] PC is at unpin_current_cpu+0x70/0x78
[ 181.171332] LR is at unpin_current_cpu+0x24/0x78
[ 181.171333] pc : [] lr : [] pstate: 60400145
[ 181.171334] sp : ffffffc1dbd57de0
[ 181.171335] x29: ffffffc1dbd57de0 x28: ffffffc1d8db1e80
[ 181.171336] x27: ffffff8008f82000 x26: 0000000000000039
[ 181.171338] x25: 0000000000000123 x24: 0000000000000015
[ 181.171339] x23: 0000000020000000 x22: ffffffc1e67d8380
[ 181.171341] x21: ffffffc1db25c740 x20: ffffffc1db25c680
[ 181.171342] x19: ffffff800984ab10 x18: 000000000000000f
[ 181.171344] x17: 0000007f81159e48 x16: ffffff800825a7e8
[ 181.171345] x15: 0000000000000050 x14: 0000000000000000
[ 181.171346] x13: 0000007ff2fff700 x12: 0000000000000000
[ 181.171348] x11: 0000005568314000 x10: 000000556830b000
[ 181.171349] x9 : 00000000000014fd x8 : ffffffc1dbd57e50
[ 181.171351] x7 : 0000000000003ff0 x6 : 0000000000003fff
[ 181.171352] x5 : 00000041ed7f7000 x4 : 0000000000000001
[ 181.171354] x3 : ffffffc1d8db1e80 x2 : 00000041ed7f7000
[ 181.171355] x1 : ffffffc1f7041b10 x0 : 0000000000000000
[ 181.171355]
[ 181.171356] —[ end trace 000000000000018a ]—
[ 181.171356] Call trace:
[ 181.171359] [] unpin_current_cpu+0x70/0x78
[ 181.171360] [] migrate_enable+0x1b4/0x2b8
[ 181.171362] [] rt_spin_unlock+0x24/0x30
[ 181.171366] [] __close_fd+0xb0/0xe8
[ 181.171368] [] SyS_close+0x28/0x68
[ 181.171369] [] __sys_trace_return+0x0/0x4

To reproduce, you actually don’t need to use stress-ng. I managed to reduce the behavior to a small C program using pthreads:

Use tegra-l4t-r32.4.3 kernel version
Apply rt patches
Use following config which uses RT: Ubuntu Pastebin
Compile program and copy to host: Ubuntu Pastebin
Launch program in infinite loop: while [ 1 ] ; do ./pthread_minimal; done

For me, the program crashes in a matter of minutes, but I also had cases with a slightly different setup where it didn’t crash for a few hours.

Any other ideas what I could try?

sumitg · November 2, 2020, 4:18pm

Please try applying the RT patches using script ‘kernel-4.9/scripts/rt-patch.sh’ instead of just applying *.patch files and share if the issue still comes. The script also does required changes to the kernel defconfig on top of applying patches.

Usages:

./scripts/rt-patch.sh apply-patches : Apply RT patches
./scripts/rt-patch.sh revert-patches : Revert RT patches

cquast · November 3, 2020, 10:41am

I still have the same behavior. I tried with the newest SDK. Here are the steps I performed:

Installed the SDK (I followed the steps inhttps://docs.nvidia.com/jetson/l4t/index.html#page/Tegra%2520Linux%2520Driver%2520Package%2520Development%2520Guide%2Fquick_start.html%23wwpID0E0BE0HA)
Compiled kernel (Welcome — Jetson Linux<br/>Developer Guide 34.1 documentation)/Tegra%2520Linux%2520Driver%2520Package%2520Development%2520Guide%2Fkernel_custom.html%23wwpID0E03D0HA](Welcome — Jetson Linux<br/>Developer Guide 34.1 documentation))
Used rt-patch.sh script to apply rt changes
After boot, compiled the pthread_minimal.c shared in the previous post, ran i=0; while [ 1 ]; do echo "Running test $i" >> foo.log; ./pthread_minimal; i=$((i+1)); done

==> On run 46, I got same prints like always, with a log of WARNINGS about unpin_current_cpu, and the system becoming unresponsive at some point and rebooting.

cquast · November 16, 2020, 11:08am

Do you have any more ideas what to try?

sumitg · November 20, 2020, 10:19am

Could you try the below change to comment the function call after applying RT patches and share if the error still is coming.

File: kernel/printk/printk.c
@@ -1638,6 +1664,7 @@ static void call_force_console_drivers(const char *force_text,
if (con->flags & CON_FORCE_LEVEL)
con->write(con, force_text, force_len);
}
+ //migrate_enable();
}

cquast · November 23, 2020, 9:11am

That makes affinity not work at all. All threads are scheduled on core 0 with this setup, instead of being distributed.

Do you have any other ideas?

sfalsig · November 25, 2020, 11:00am

I’m working with cquast on this issue, and we’re still not able to run the RT-patched kernel while specifying the affinity of some of our threads. This would be really useful for us to be able to specify which threads need to run on the Denver cores.

Is there anything else we can try?

I can see that the current rt-patch folder in the kernel-4.9 repo seems to be based on the 4.9.115-rt93 patch, but that the kernel itself is at 4.9.140. Is there maybe something that got changed between .115 and .140 that isn’t reflected in the current provided rt patch?

sumitg · April 21, 2021, 1:41pm

Sorry to reply late.
If you are still facing this issue, could you please try reverting below patch and share if the problem still comes.
kernel/kernel-4.9$ git log --oneline kernel/sched/core.c | grep “sched/core: Fix rules for running on online”
e363fcd BACKPORT: sched/core: Fix rules for running on online && !active CPUs

AnishAney · April 21, 2021, 7:17pm

Yes, this works. I had to revert this patch to get it working, at least for my similar testing.

cquast · April 22, 2021, 2:48pm

I can’t check anymore in our setup, but if we ever have this problem again, I will try it out. Thanks for the answer, even if it’s a bite late!