NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [mmc_cmdq_d/0:1506]

maxlapshin · May 2, 2020, 9:14am

[79237.308350] INFO: rcu_sched self-detected stall on CPU
[79237.313494] 0-…: (1 GPs behind) idle=513/140000000000001/0 softirq=788211/788211 fqs=2607
[79237.322000] (t=5250 jiffies g=961921 c=961920 q=4232)
[79237.327224] Task dump for CPU 0:
[79237.327226] mmc_cmdq_d/0 S 0 1506 2 0x00000002
[79237.327231] Call trace:
[79237.327240] [] dump_backtrace+0x0/0x180
[79237.327243] [] show_stack+0x24/0x30
[79237.327248] [] sched_show_task+0xe8/0x130
[79237.327251] [] dump_cpu_task+0x48/0x58
[79237.327256] [] rcu_dump_cpu_stacks+0xb8/0xec
[79237.327260] [] rcu_check_callbacks+0x65c/0x8a0
[79237.327263] [] update_process_times+0x34/0x60
[79237.327266] [] tick_sched_handle.isra.5+0x38/0x70
[79237.327268] [] tick_sched_timer+0x4c/0x90
[79237.327271] [] __hrtimer_run_queues+0xdc/0x2d0
[79237.327273] [] hrtimer_interrupt+0xa0/0x1d0
[79237.327277] [] tegra186_timer_isr+0x34/0x48
[79237.327280] [] __handle_irq_event_percpu+0x68/0x228
[79237.327282] [] handle_irq_event_percpu+0x28/0x60
[79237.327284] [] handle_irq_event+0x54/0x88
[79237.327286] [] handle_fasteoi_irq+0xc0/0x1b8
[79237.327288] [] generic_handle_irq+0x38/0x50
[79237.327290] [] __handle_domain_irq+0x68/0xc0
[79237.327293] [] gic_handle_irq+0x5c/0xb0
[79237.327295] [] el1_irq+0xe8/0x178
[79237.327298] [] mmc_cmdq_thread+0x210/0x218
[79237.327302] [] kthread+0xec/0xf0
[79237.327304] [] ret_from_fork+0x10/0x40
[79261.336695] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [mmc_cmdq_d/0:1506]

[79261.344515] CPU: 0 PID: 1506 Comm: mmc_cmdq_d/0 Not tainted 4.9.140+ #1
[79261.344517] Hardware name: lightning (DT)
[79261.344519] task: ffffffc0ef858dc0 task.stack: ffffffc0ee2a0000
[79261.344526] PC is at mmc_cmdq_thread+0xd0/0x218
[79261.344528] LR is at mmc_cmdq_thread+0xb8/0x218
[79261.344531] pc : [<ffffff8008a22670>] lr : [<ffffff8008a22658>] pstate: 40400045
[79261.344532] sp : ffffffc0ee2a3dc0
[79261.344533] x29: ffffffc0ee2a3dc0 x28: ffffffc0ccde3380 
[79261.344537] x27: ffffff8009022e90 x26: ffffff8009951450 
[79261.344541] x25: ffffffc0efa79030 x24: ffffffc0efa79018 
[79261.344544] x23: 0000000000000001 x22: ffffffc0efb51580 
[79261.344547] x21: ffffffc0efb51000 x20: ffffffc0efb51568 
[79261.344550] x19: ffffffc0ee4dadf0 x18: 0000000000000000 
[79261.344553] x17: 0000000000000000 x16: 0000000000000000 
[79261.344556] x15: 0000000000000000 x14: 0000a8b2466b97bf 
[79261.344559] x13: 0000000000000375 x12: 071c71c71c71c71c 
[79261.344562] x11: 0000000000000003 x10: 0000000000000000 
[79261.344565] x9 : 0000000000000001 x8 : 0000000000000000 
[79261.344567] x7 : ffffffc0f1001b00 x6 : 000002400f2dab89 
[79261.344570] x5 : 0000000000000000 x4 : 0000000000000001 
[79261.344573] x3 : 00000040ed457000 x2 : ffffffc0ccde3380 
[79261.344576] x1 : 00000000000053f3 x0 : ffffffc0efa79000 

[79289.337098] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [mmc_cmdq_d/0:1506]

What can it mean? Jetson TX2 4GB.

maxlapshin · May 2, 2020, 9:21am

It happens when I start writing big files on internal MMC.

WayneWWW · May 4, 2020, 6:21am

Could you share the command you are using to reproduce this issue?

It looks like “writing big files” cause the cpu hang so the watchdog timer triggers the reboot. But this should not happen so we want to check if this is a bug or something else.

maxlapshin · May 4, 2020, 12:31pm

I’ve uploaded our kernel config
config-32.2.3.log (119.3 KB)

I’m just making scp via builtin USB interface. Jetson TX2 installed on devboard is plugged via nvidia USB cable from recovery port to USB3 on my laptop.

Ethernet is established over USB and I run scp for 2 gigabyte file.

This error happens each time.

linuxdev · May 4, 2020, 3:46pm

I’m wondering if perhaps it is just slowing down from all the data. Or if perhaps you ran out of space. If you start by making sure you have plenty of extra space (“df -H /”, or whatever location copy goes to), and “sudo nvpmodel -m 0” followed by “sudo jetson_clocks”), then you’ll know you have space and all cores are running max…any time delay causing watchdog would not be from those issues.

If performance mode does not help, and if you have enough disk space, then try the test which follows…

For reference in the next section, 2GB is exactly 2147483648 bytes. Also equals 1024*1024*2048, and 1024 squared is 1048576.

2GB happens to be the old signed 32-bit limit as well. If you create a test file just under 2GB on your host PC like this, does it also fail from scp?
dd if=/dev/null of=my_test_under2gb.bin bs=1048576 count=2047

This would create a file which is 1MB less than 2GB. If this transfers without failure, then create another file which is just over 2GB to see if crossing 2GB causes failure:
dd if=/dev/null of=my_test_over2gb.bin bs=1048576 count=2049

maxlapshin · May 4, 2020, 4:23pm

Will test if, thanks.

Also it seems not to happen on vanilla nvidia kernel.

Can you please explain: nvpmodel -m 0 vs -m 2 or -m 3

What is the difference? Which gives most power consumptive and powerful mode for 4GB jetson?

linuxdev · May 4, 2020, 4:54pm

The “-m 0” enables all cores throughout their full clock range. Running this, followed by using “jetson_clocks” to maximize speed within that range will give the best performance possible. I’m not sure what “-m 2” or “-m 3” states do, but those will limit either active core count or clock speeds. Even if you want to limit power consumption you should at least test with “-m 0” and maxed clocks.

maxlapshin · May 5, 2020, 6:27am

Looks that problem was solved when I have changed kernel settings from TREE_RCU to PREEMPT_RCU.

nvpmodel was already -m 2, jetson_clock called