Minimal rootfs with busybox - error messages

sevm89 · March 6, 2020, 7:26am

Hi NVidia Team

For testings we run Jetson AGX Xavier Modules with our own carrier board on a minimal rootfs with busybox. We build it with the information of the following page:
https://elinux.org/Jetson/Busybox_RootFS

We now have some systems that sometimes show the following error message in dmesg:

[   39.080659] INFO: rcu_preempt self-detected stall on CPU
[   39.080835]  0-...: (1 GPs behind) idle=4df/140000000000002/0 softirq=358/358 fqs=2001
[   39.080993]   (t=5250 jiffies g=-106 c=-107 q=242)
[   39.081088] Task dump for CPU 0:
[   39.081164] swapper/0       R  running task        0     0      0 0x00000002
[   39.081300] Call trace:
[   39.081363] [<ffffff800808bdb8>] dump_backtrace+0x0/0x198
[   39.081458] [<ffffff800808c37c>] show_stack+0x24/0x30
[   39.081557] [<ffffff80080ed4d0>] sched_show_task+0xf8/0x148
[   39.081657] [<ffffff80080f01b0>] dump_cpu_task+0x48/0x58
[   39.081758] [<ffffff80081c282c>] rcu_dump_cpu_stacks+0xb8/0xec
[   39.081860] [<ffffff80081331a8>] rcu_check_callbacks+0x728/0xa48
[   39.081962] [<ffffff8008139a04>] update_process_times+0x34/0x60
[   39.082121] [<ffffff800814af68>] tick_sched_handle.isra.5+0x38/0x70
[   39.082587] [<ffffff800814afec>] tick_sched_timer+0x4c/0x90
[   39.083038] [<ffffff800813a738>] __hrtimer_run_queues+0xd8/0x360
[   39.083457] [<ffffff800813b088>] hrtimer_interrupt+0xa8/0x1e0
[   39.083903] [<ffffff8008be9e00>] arch_timer_handler_phys+0x38/0x58
[   39.088652] INFO: rcu_sched detected stalls on CPUs/tasks:
[   39.088661]  0-...: (2 GPs behind) idle=4df/140000000000002/0 softirq=356/358 fqs=2260
[   39.088669]  (detected by 1, t=5252 jiffies, g=-291, c=-292, q=18)
[   39.088670] Task dump for CPU 0:
[   39.088677] swapper/0       R  running task        0     0      0 0x00000002
[   39.088678] Call trace:
[   39.088699] [<ffffff80080863bc>] __switch_to+0x9c/0xc0
[   39.088702] [<0000000000000002>] 0x2
[   39.130985] [<ffffff8008127c68>] handle_percpu_devid_irq+0x90/0x2b0
[   39.137266] [<ffffff800812224c>] generic_handle_irq+0x34/0x50
[   39.142866] [<ffffff8008122930>] __handle_domain_irq+0x68/0xc0
[   39.148904] [<ffffff8008080d44>] gic_handle_irq+0x5c/0xb0
[   39.154416] [<ffffff8008082be8>] el1_irq+0xe8/0x18c
[   39.159317] [<ffffff80081383f8>] call_timer_fn+0x38/0x1e0
[   39.164829] [<ffffff8008138714>] expire_timers+0x144/0x188
[   39.170251] [<ffffff800813889c>] run_timer_softirq+0x144/0x178
[   39.176215] [<ffffff8008081054>] __do_softirq+0x13c/0x3b0
[   39.181545] [<ffffff80080bb298>] irq_exit+0xd0/0x118
[   39.186526] [<ffffff8008122934>] __handle_domain_irq+0x6c/0xc0
[   39.192391] [<ffffff8008080d44>] gic_handle_irq+0x5c/0xb0
[   39.197725] [<ffffff8008082be8>] el1_irq+0xe8/0x18c
[   39.202725] [<ffffff800814b320>] tick_nohz_idle_exit+0xe8/0x118
[   39.208754] [<ffffff8008112cd0>] cpu_startup_entry+0xe8/0x200
[   39.214533] [<ffffff8008f45d8c>] rest_init+0x84/0x90
[   39.219346] [<ffffff80095e0b64>] start_kernel+0x370/0x384
[   39.224776] [<ffffff80095e0204>] __primary_switched+0x80/0x94
[   39.230454] INFO: rcu_preempt detected stalls on CPUs/tasks:
[   39.235916]  0-...: (1 GPs behind) idle=4df/140000000000002/0 softirq=358/358 fqs=2002
[   39.244102]  (detected by 7, t=5252 jiffies, g=-106, c=-107, q=242)
[   39.250576] Task dump for CPU 0:
[   39.253990] swapper/0       R  running task        0     0      0 0x00000002
[   39.260903] Call trace:
[   39.263534] [<ffffff80080863bc>] __switch_to+0x9c/0xc0
[   39.268865] [<0000000000000002>] 0x2
[   42.145760] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
[   42.166095] usb 1-4.1: new high-speed USB device number 5 using tegra-xusb
[   42.187059] usb 1-4.1: config 1 has an invalid interface number: 8 but max is 4
[   42.187273] usb 1-4.1: config 1 has an invalid interface number: 10 but max is 4
[   42.187430] usb 1-4.1: config 1 has no interface number 1
[   42.187534] usb 1-4.1: config 1 has no interface number 4
[   42.188597] usb 1-4.1: New USB device found, idVendor=1199, idProduct=9071
[   42.188863] usb 1-4.1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[   42.189016] usb 1-4.1: Product: Sierra Wireless MC7455 QualcommÂ® Snapdragonâ
                                                                               ¢ X7 LTE-A
[   42.189180] usb 1-4.1: Manufacturer: Sierra Wireless, Incorporated
[   42.189345] usb 1-4.1: SerialNumber: LQ94110315031028
[   42.439491] mmc1: Enabling vmmc regulator
[   42.642570] mmc1: Disabling vmmc regulator
[   43.058357] mmc1: Enabling vmmc regulator
[   43.266395] mmc1: Disabling vmmc regulator
[   45.315458] igb 0004:01:00.0 eth1: igb: eth1 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX
[   45.316110] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
[   45.473147] mmc1: Enabling vmmc regulator
[   51.696562] mmc1: Enabling vmmc regulator
[   51.898650] mmc1: Disabling vmmc regulator
[   55.446304] mmc1: Enabling vmmc regulator
[   55.654330] mmc1: Disabling vmmc regulator
[   57.292723] mmc1: Enabling vmmc regulator
[   57.498670] mmc1: Disabling vmmc regulator
[   57.906422] mmc1: Enabling vmmc regulator
[   58.110297] mmc1: Disabling vmmc regulator
[   60.154323] mmc1: Enabling vmmc regulator
[   60.358306] mmc1: Disabling vmmc regulator
[   64.734335] mmc1: Enabling vmmc regulator
[   65.358345] mmc1: Disabling vmmc regulator
[   66.190512] mmc1: Enabling vmmc regulator
[   66.394690] mmc1: Disabling vmmc regulator
[   69.322394] mmc1: Enabling vmmc regulator
[   69.526405] mmc1: Disabling vmmc regulator
[   78.102333] mmc1: Enabling vmmc regulator
[   78.306389] mmc1: Disabling vmmc regulator
[   80.754440] mmc1: Enabling vmmc regulator
[   80.958389] mmc1: Disabling vmmc regulator
[   85.242473] mmc1: Enabling vmmc regulator
[   85.446336] mmc1: Disabling vmmc regulator
[   87.282478] mmc1: Enabling vmmc regulator
[   87.486520] mmc1: Disabling vmmc regulator
[   98.510555] mmc1: Enabling vmmc regulator
[   98.714462] mmc1: Disabling vmmc regulator
[  104.426351] mmc1: Enabling vmmc regulator
[  104.630411] mmc1: Disabling vmmc regulator
[  104.834572] mmc1: Enabling vmmc regulator
[  105.038331] mmc1: Disabling vmmc regulator
[  108.726747] mmc1: Enabling vmmc regulator
[  108.930351] mmc1: Disabling vmmc regulator
[  110.568492] mmc1: Enabling vmmc regulator
[  110.770725] mmc1: Disabling vmmc regulator
[  113.020346] mmc1: Enabling vmmc regulator
[  113.222566] mmc1: Disabling vmmc regulator
[  150.367067] mmc1: Enabling vmmc regulator
[  150.570801] mmc1: Disabling vmmc regulator
[  165.258476] mmc1: Enabling vmmc regulator
[  165.462926] mmc1: Disabling vmmc regulator

Is there a problem with our rootfs or what could lead to these messages?
Thank you for your help.

linuxdev · March 6, 2020, 9:12pm

You will probably need to provide a serial console log. Even if the Linux stage does not have serial console enabled (but hopefully it does) a log can show setup until the switch to the Linux kernel.

sevm89 · July 28, 2020, 2:03pm

We have a new production and we do temperature cycling tests with our custom carrier board and the jetson agx xavier module. Now we see the strange behavior, that we get cpu stalls at negative temperatures (around -25°C). The dmesg of such errors are below:

[ 84.552730] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0]
[ 84.552927] Modules linked in:
[ 84.552997]
[ 84.553042] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.140-tegra #1
[ 84.553151] Hardware name: Jetson-AGX (DT)
[ 84.553237] task: ffffff8009e513c0 task.stack: ffffff8009e40000
[ 84.553350] PC is at __do_softirq+0xb8/0x3b0
[ 84.553428] LR is at __do_softirq+0x74/0x3b0
[ 84.553514] pc : [] lr : [] pstate: 40400045
[ 84.553641] sp : ffffffc7ffd6df10
[ 84.553716] x29: ffffffc7ffd6df10 x28: ffffff8009e513c0
[ 84.553835] x27: ffffff8009e46000 x26: ffffffc7ffd6e050
[ 84.553946] x25: ffffff8009805018 x24: ffffffc7dc810800
[ 84.554054] x23: ffffff8009e43d30 x22: 0000000000000000
[ 84.554159] x21: 0000000000000000 x20: 0000000000000040
[ 84.554608] x19: ffffff8009e513c0 x18: 0000000000000014
[ 84.555047] x17: 000000000000000e x16: 0000000000000007
[ 84.555490] x15: 0000000000000001 x14: 0000000000000019
[ 84.555934] x13: 0000000000000033 x12: 000000000000004c
[ 84.560897] x11: 0000000000000068 x10: 0000000000000040
[ 84.566501] x9 : ffffff8009e64440 x8 : ffffffc7dc400000
[ 84.572277] x7 : ffffffc7dc400028 x6 : 000000000cbc8e3b
[ 84.577799] x5 : 00ffffffffffffff x4 : 0000000000000015
[ 84.582874] x3 : 0000000000000001 x2 : 00000047f6565000
[ 84.588462] x1 : ffffff800a11d840 x0 : 0000000000000000
[ 84.593797]

[ 178.234739] random: crng init done
[ 348.417616] mmc1: Enabling vmmc regulator
[ 506.799707] mmc1: Disabling vmmc regulator
[ 507.245842] mmc1: Enabling vmmc regulator
[ 537.905194] mmc1: Disabling vmmc regulator
[ 538.769976] mmc1: Enabling vmmc regulator
[ 599.934460] INFO: rcu_preempt self-detected stall on CPU
[ 599.934645] 0-…: (1 GPs behind) idle=8b5/140000000000001/0 softirq=2979/2979 fqs=2624
[ 599.934809] (t=5250 jiffies g=405 c=404 q=3)
[ 599.934902] Task dump for CPU 0:
[ 599.934970] kworker/0:1 R running task 0 733 2 0x00000002
[ 599.935146] Workqueue: events igb_watchdog_task
[ 599.935244] Call trace:
[ 599.935303] [] dump_backtrace+0x0/0x198
[ 599.935401] [] show_stack+0x24/0x30
[ 599.935515] [] sched_show_task+0xf8/0x148
[ 599.935644] [] dump_cpu_task+0x48/0x58
[ 599.935753] [] rcu_dump_cpu_stacks+0xb8/0xec
[ 599.935870] [] rcu_check_callbacks+0x728/0xa48
[ 599.935980] [] update_process_times+0x34/0x60
[ 599.936268] [] tick_sched_handle.isra.5+0x38/0x70
[ 599.936759] [] tick_sched_timer+0x4c/0x90
[ 599.937197] [] __hrtimer_run_queues+0xd8/0x360
[ 599.937652] [] hrtimer_interrupt+0xa8/0x1e0
[ 599.942101] [] arch_timer_handler_phys+0x38/0x58
[ 599.948045] [] handle_percpu_devid_irq+0x90/0x2b0
[ 599.954256] [] generic_handle_irq+0x34/0x50
[ 599.959856] [] __handle_domain_irq+0x68/0xc0
[ 599.965717] [] gic_handle_irq+0x5c/0xb0
[ 599.970795] [] el1_irq+0xe8/0x18c
[ 599.975609] [] igb_rd32+0x30/0xc0
[ 599.980602] [] igb_update_stats+0x6dc/0x8a8
[ 599.986542] [] igb_watchdog_task+0xfc/0x750
[ 599.992333] [] process_one_work+0x1e4/0x4b0
[ 599.997675] [] worker_thread+0x50/0x4c8
[ 600.003178] [] kthread+0xec/0xf0
[ 600.008419] [] ret_from_fork+0x10/0x40
[ 624.582461] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:1:733]
[ 624.582666] Modules linked in:
[ 624.582745]
[ 624.582802] CPU: 0 PID: 733 Comm: kworker/0:1 Not tainted 4.9.140-tegra #1
[ 624.582926] Hardware name: Jetson-AGX (DT)
[ 624.583025] Workqueue: events igb_watchdog_task
[ 624.583127] task: ffffffc7db15e200 task.stack: ffffffc7da56c000
[ 624.583239] PC is at igb_rd32+0x24/0xc0
[ 624.583315] LR is at igb_update_stats+0x784/0x8a8
[ 624.583405] pc : [] lr : [] pstate: 00c00045
[ 624.583534] sp : ffffffc7da56fc90
[ 624.583604] x29: ffffffc7da56fc90 x28: 0000000000000000
[ 624.583725] x27: 0000000000000000 x26: 00000000000000ea
[ 624.583853] x25: 0000000000003b09 x24: 0000000000000000
[ 624.583962] x23: ffffffc7b65d0de8 x22: ffffffc7b65d0900
[ 624.584345] x21: 00000000000040bc x20: ffffffc7b65d0eb0
[ 624.584776] x19: ffffffc7b65d0900 x18: 000000000000ba7e
[ 624.585212] x17: 000000000000000e x16: 0000000000000000
[ 624.585653] x15: 000000000000a24a x14: 000000000000ba7e
[ 624.590144] x13: 000000000000f6f8 x12: 0000000000000000
[ 624.595753] x11: 000000000000aba8 x10: 0000000000000a20
[ 624.601525] x9 : ffffffc7da56fd10 x8 : ffffffc7db15ec80
[ 624.607293] x7 : 0000000000000000 x6 : 0000000000000000
[ 624.612804] x5 : 0000000000000000 x4 : 0000000000000000
[ 624.618142] x3 : 0000000000000000 x2 : 0000000000000000
[ 624.623480] x1 : 00000000000040bc x0 : ffffff8011600000
[ 624.628586]

We have systems with the exact same configuration and minimal filesystem, that go through these temperature cycling without any errors, but like 30-40% of all devices show cpu stalls, only at negative temperatures. Do you have any idea what could stall the cpu? Any help for debugging this topic is appreciated.
Thank you.

linuxdev · July 28, 2020, 9:59pm

I doubt I can help, but wanted to point out this:
[ 624.583025] Workqueue: events igb_watchdog_task

“IGB” should be the Intel network driver. Whatever you have attached to the board at the time of failure would be important, and if the failure is always from the IGB driver, then it implies the Intel gigabit device probably does not work at this temperature. Is it always IGB failing, or does the failure change? I assume IGB is something you’ve added since the default module does not use an Intel gigabit chip (correct me if I am wrong).

sevm89 · July 29, 2020, 7:12am

Hi linuxdev

Thank you for your answer.
We have one I210 attached to the PCI. Over this interface, a second system is connected. Our test is just a simple check if we can reach the A3 system via ethernet from the second system.
The failure is not always igb, we also saw messages about sdhci and others like:

[ 56.557212] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0]
[ 56.557399] Modules linked in:
[ 56.557480]
[ 56.557532] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.140-tegra #1
[ 56.557643] Hardware name: Jetson-AGX (DT)
[ 56.557735] task: ffffff8009e513c0 task.stack: ffffff8009e40000
[ 56.557858] PC is at led_set_brightness+0x54/0x90
[ 56.557941] LR is at led_trigger_event+0x54/0x80
[ 56.558035] pc : [] lr : [] pstate: 20400045
[ 56.558165] sp : ffffffc7ffd6ddf0
[ 56.558240] x29: ffffffc7ffd6ddf0 x28: 0000000000000007
[ 56.558358] x27: ffffff8009e450f0 x26: ffffff8009e450c0
[ 56.558470] x25: 0000000000000000 x24: ffffff800a11d000
[ 56.558584] x23: 0000000064000008 x22: 0000000000000000
[ 56.558708] x21: ffffffc7d82f9b98 x20: 0000000000000000
[ 56.559147] x19: ffffffc7da4e88b0 x18: 0000000000002cce
[ 56.559587] x17: 0000000000000002 x16: 0000000000000000
[ 56.560025] x15: 0000000000000000 x14: 0000000000000000
[ 56.560455] x13: 00000000000007bd x12: 071c71c71c71c71c
[ 56.565848] x11: 000000000000000b x10: 0000000000000040
[ 56.571545] x9 : ffffff8009e64440 x8 : ffffffc7dc400000
[ 56.577314] x7 : ffffffc7dc400028 x6 : 000000000caae94e
[ 56.582828] x5 : 00ffffffffffffff x4 : 0000000000000001
[ 56.588414] x3 : ffffff8008bb74f8 x2 : 0000000000000054
[ 56.593750] x1 : 0000000000000000 x0 : 0000000000000000
[ 56.599087]

[ 288.576221] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0]
[ 288.576404] Modules linked in:
[ 288.576479]
[ 288.576530] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.140-tegra #1
[ 288.576645] Hardware name: Jetson-AGX (DT)
[ 288.576722] task: ffffff8009e513c0 task.stack: ffffff8009e40000
[ 288.576831] PC is at _raw_spin_unlock_irqrestore+0x30/0x60
[ 288.576930] LR is at sdhci_tasklet_finish+0x108/0x190
[ 288.577025] pc : [] lr : [] pstate: 40400045
[ 288.577150] sp : ffffffc7ffd6de70
[ 288.577212] x29: ffffffc7ffd6de70 x28: 0000000000000007
[ 288.577323] x27: ffffff8009e450f0 x26: ffffff8009e450c0
[ 288.577436] x25: 0000000000000000 x24: ffffff800a11d000
[ 288.577543] x23: 0000000064000008 x22: 0000000000000040
[ 288.577802] x21: ffffffc7c81faa38 x20: ffffffc7c81faa38
[ 288.578245] x19: 0000000000000040 x18: 000000000000ba7e
[ 288.578686] x17: 000000000000000e x16: 0000000000000000
[ 288.579126] x15: 0000000000000000 x14: 0000000000000000
[ 288.580641] x13: 00000000000006f1 x12: 071c71c71c71c71c
[ 288.586414] x11: 000000000000000b x10: 0000000000000040
[ 288.592017] x9 : ffffff8009e64440 x8 : ffffffc7dc400000
[ 288.597789] x7 : ffffffc7dc400028 x6 : 000000000cbf59f8
[ 288.603302] x5 : 00ffffffffffffff x4 : 0000000000000001
[ 288.608639] x3 : ffffff8008bb74f8 x2 : 0000000000000054
[ 288.613977] x1 : 0000000000000040 x0 : 0000000000000001
[ 288.619313]
[ 340.796221] INFO: rcu_preempt self-detected stall on CPU
[ 340.796395] 0-…: (2 GPs behind) idle=aa3/140000000000001/0 softirq=912/912 fqs=2611
[ 340.796555] (t=5250 jiffies g=144 c=143 q=21)
[ 340.796647] Task dump for CPU 0:
[ 340.796712] ksoftirqd/0 R running task 0 3 2 0x00000002
[ 340.796847] Call trace:
[ 340.796910] [] dump_backtrace+0x0/0x198
[ 340.797020] [] show_stack+0x24/0x30
[ 340.797115] [] sched_show_task+0xf8/0x148
[ 340.797227] [] dump_cpu_task+0x48/0x58
[ 340.797319] [] rcu_dump_cpu_stacks+0xb8/0xec
[ 340.797422] [] rcu_check_callbacks+0x728/0xa48
[ 340.797523] [] update_process_times+0x34/0x60
[ 340.797642] [] tick_sched_handle.isra.5+0x38/0x70
[ 340.798120] [] tick_sched_timer+0x4c/0x90
[ 340.798556] [] __hrtimer_run_queues+0xd8/0x360
[ 340.798997] [] hrtimer_interrupt+0xa8/0x1e0
[ 340.799475] [] arch_timer_handler_phys+0x38/0x58
[ 340.805142] [] handle_percpu_devid_irq+0x90/0x2b0
[ 340.811092] [] generic_handle_irq+0x34/0x50
[ 340.816693] [] __handle_domain_irq+0x68/0xc0
[ 340.822813] [] gic_handle_irq+0x5c/0xb0
[ 340.828149] [] el1_irq+0xe8/0x18c
[ 340.832963] [] load_balance+0x110/0xa20
[ 340.838471] [] rebalance_domains+0x1a4/0x2c8
[ 340.844423] [] run_rebalance_domains+0x154/0x218
[ 340.850723] [] __do_softirq+0x13c/0x3b0
[ 340.856065] [] run_ksoftirqd+0x48/0x58
[ 340.861577] [] smpboot_thread_fn+0x160/0x248
[ 340.867611] [] kthread+0xec/0xf0
[ 340.872511] [] ret_from_fork+0x10/0x40
[ 340.878027] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 340.883569] 0-…: (2 GPs behind) idle=aa3/140000000000000/0 softirq=912/912 fqs=2611
[ 340.891758] (detected by 4, t=5274 jiffies, g=144, c=143, q=21)
[ 340.897716] Task dump for CPU 0:
[ 340.901297] ksoftirqd/0 R running task 0 3 2 0x00000002
[ 340.908300] Call trace:
[ 340.911016] [] __switch_to+0x9c/0xc0
[ 340.916260] [] 0xffffffc700000001
[ 394.584212] INFO: rcu_preempt self-detected stall on CPU
[ 394.584369] 0-…: (1 GPs behind) idle=0f5/2/0 softirq=926/927 fqs=2625
[ 394.584488] (t=5250 jiffies g=152 c=151 q=4)
[ 394.584575] Task dump for CPU 0:
[ 394.584660] swapper/0 R running task 0 0 0 0x00000002
[ 394.584787] Call trace:
[ 394.584865] [] dump_backtrace+0x0/0x198
[ 394.584964] [] show_stack+0x24/0x30
[ 394.585073] [] sched_show_task+0xf8/0x148
[ 394.585171] [] dump_cpu_task+0x48/0x58
[ 394.585268] [] rcu_dump_cpu_stacks+0xb8/0xec
[ 394.585393] [] rcu_check_callbacks+0x728/0xa48
[ 394.585492] [] update_process_times+0x34/0x60
[ 394.585605] [] tick_sched_handle.isra.5+0x38/0x70
[ 394.585996] [] tick_sched_timer+0x4c/0x90
[ 394.586450] [] __hrtimer_run_queues+0xd8/0x360
[ 394.586885] [] hrtimer_interrupt+0xa8/0x1e0
[ 394.587333] [] arch_timer_handler_phys+0x38/0x58
[ 394.591816] [] handle_percpu_devid_irq+0x90/0x2b0
[ 394.598026] [] generic_handle_irq+0x34/0x50
[ 394.603625] [] __handle_domain_irq+0x68/0xc0
[ 394.609486] [] gic_handle_irq+0x5c/0xb0
[ 394.614823] [] el1_irq+0xe8/0x18c
[ 394.619641] [] sdhci_led_control+0x90/0x108
[ 394.625419] [] led_set_brightness_nopm+0x30/0x58
[ 394.631625] [] led_set_brightness+0x5c/0x90
[ 394.637398] [] led_trigger_event+0x54/0x80
[ 394.643002] [] mmc_request_done+0x3ac/0x3f0
[ 394.648773] [] sdhci_tasklet_finish+0x114/0x190
[ 394.655078] [] tasklet_action+0x70/0x108
[ 394.660241] [] __do_softirq+0x13c/0x3b0
[ 394.666011] [] irq_exit+0xd0/0x118
[ 394.670999] [] __handle_domain_irq+0x6c/0xc0
[ 394.676860] [] gic_handle_irq+0x5c/0xb0
[ 394.682286] [] el1_irq+0xe8/0x18c
[ 394.687191] [] cpuidle_enter_state+0xb8/0x380
[ 394.693227] [] cpuidle_enter+0x34/0x48
[ 394.698652] [] call_cpuidle+0x44/0x70
[ 394.703898] [] cpu_startup_entry+0x1b0/0x200
[ 394.709855] [] rest_init+0x84/0x90
[ 394.714584] [] start_kernel+0x370/0x384
[ 394.720093] [] __primary_switched+0x80/0x94
[ 394.725953] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 394.731736] 0-…: (1 GPs behind) idle=0f5/2/0 softirq=926/927 fqs=2625
[ 394.738460] (detected by 4, t=5288 jiffies, g=152, c=151, q=4)
[ 394.744585] Task dump for CPU 0:
[ 394.747998] swapper/0 R running task 0 0 0 0x00000002
[ 394.755173] Call trace:
[ 394.757803] [] __switch_to+0x9c/0xc0
[ 394.762965] [] cpuidle_enter_state+0xa0/0x380
[ 394.768825] [] cpuidle_enter+0x34/0x48
[ 394.774336] [] call_cpuidle+0x44/0x70
[ 394.779412] [] cpu_startup_entry+0x1b0/0x200
[ 394.785450] [] rest_init+0x84/0x90
[ 394.790349] [] start_kernel+0x370/0x384
[ 394.795864] [] __primary_switched+0x80/0x94
[ 420.576217] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0]
[ 420.576376] Modules linked in:
[ 420.576465]
[ 420.576507] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G L 4.9.140-tegra #1
[ 420.576632] Hardware name: Jetson-AGX (DT)
[ 420.576708] task: ffffff8009e513c0 task.stack: ffffff8009e40000
[ 420.576824] PC is at _raw_spin_unlock_irqrestore+0x30/0x60
[ 420.576920] LR is at sdhci_led_control+0x90/0x108
[ 420.577023] pc : [] lr : [] pstate: 00400045
[ 420.577147] sp : ffffffc7ffd6dd70
[ 420.577206] x29: ffffffc7ffd6dd70 x28: 0000000000000007
[ 420.577318] x27: ffffff8009e450f0 x26: ffffff8009e450c0
[ 420.577424] x25: 0000000000000000 x24: ffffff800a11d000
[ 420.577534] x23: 0000000000000000 x22: 0000000000000040
[ 420.577846] x21: ffffffc7c81fa7c0 x20: ffffffc7c81faa38
[ 420.578282] x19: 0000000000000040 x18: 0000000000000000
[ 420.578713] x17: 0000000000000002 x16: 0000000000000003
[ 420.579160] x15: 0000000000000000 x14: 00040000004ffbf7
[ 420.581314] x13: 000000000000001b x12: 071c71c71c71c71c
[ 420.587089] x11: 000000000000000b x10: 0000000000000040
[ 420.592868] x9 : ffffff8009e64440 x8 : ffffffc7dc400000
[ 420.598639] x7 : ffffffc7dc400028 x6 : ffffffc7dc21aa40
[ 420.604153] x5 : 0000000000000004 x4 : 0000000000000000
[ 420.609489] x3 : ffffff8008bb74f8 x2 : 0000000000000028
[ 420.614827] x1 : 0000000000000040 x0 : 0000000000000001
[ 420.620165]
[ 457.644223] INFO: rcu_preempt self-detected stall on CPU
[ 457.644381] 0-…: (1 GPs behind) idle=0f5/2/0 softirq=926/927 fqs=10480
[ 457.644499] (t=21015 jiffies g=152 c=151 q=4)
[ 457.644588] Task dump for CPU 0:
[ 457.644650] swapper/0 R running task 0 0 0 0x00000002
[ 457.644796] Call trace:
[ 457.644854] [] dump_backtrace+0x0/0x198
[ 457.644956] [] show_stack+0x24/0x30
[ 457.645051] [] sched_show_task+0xf8/0x148
[ 457.645149] [] dump_cpu_task+0x48/0x58
[ 457.645245] [] rcu_dump_cpu_stacks+0xb8/0xec
[ 457.645363] [] rcu_check_callbacks+0x728/0xa48
[ 457.645462] [] update_process_times+0x34/0x60
[ 457.645564] [] tick_sched_handle.isra.5+0x38/0x70
[ 457.646054] [] tick_sched_timer+0x4c/0x90
[ 457.646473] [] __hrtimer_run_queues+0xd8/0x360
[ 457.646913] [] hrtimer_interrupt+0xa8/0x1e0
[ 457.647361] [] arch_timer_handler_phys+0x38/0x58
[ 457.652001] [] handle_percpu_devid_irq+0x90/0x2b0
[ 457.658211] [] generic_handle_irq+0x34/0x50
[ 457.663814] [] __handle_domain_irq+0x68/0xc0
[ 457.669674] [] gic_handle_irq+0x5c/0xb0
[ 457.675009] [] el1_irq+0xe8/0x18c
[ 457.679829] [] sdhci_led_control+0x90/0x108
[ 457.685601] [] led_set_brightness_nopm+0x30/0x58
[ 457.691811] [] led_set_brightness+0x5c/0x90
[ 457.697585] [] led_trigger_event+0x54/0x80
[ 457.703188] [] mmc_request_done+0x3ac/0x3f0
[ 457.708702] [] sdhci_tasklet_finish+0x114/0x190
[ 457.715264] [] tasklet_action+0x70/0x108
[ 457.720685] [] __do_softirq+0x13c/0x3b0
[ 457.726197] [] irq_exit+0xd0/0x118
[ 457.731185] [] __handle_domain_irq+0x6c/0xc0
[ 457.737051] [] gic_handle_irq+0x5c/0xb0
[ 457.742473] [] el1_irq+0xe8/0x18c
[ 457.747377] [] cpuidle_enter_state+0xb8/0x380
[ 457.753410] [] cpuidle_enter+0x34/0x48
[ 457.758837] [] call_cpuidle+0x44/0x70
[ 457.764085] [] cpu_startup_entry+0x1b0/0x200
[ 457.770042] [] rest_init+0x84/0x90
[ 457.775028] [] start_kernel+0x370/0x384
[ 457.780537] [] __primary_switched+0x80/0x94
[ 484.576223] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [swapper/0:0]
[ 484.576385] Modules linked in:
[ 484.576468]
[ 484.576508] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G L 4.9.140-tegra #1
[ 484.576634] Hardware name: Jetson-AGX (DT)
[ 484.576708] task: ffffff8009e513c0 task.stack: ffffff8009e40000
[ 484.576818] PC is at _raw_spin_unlock_irqrestore+0x30/0x60
[ 484.576914] LR is at sdhci_led_control+0x90/0x108
[ 484.576994] pc : [] lr : [] pstate: 00400045
[ 484.577116] sp : ffffffc7ffd6dd70
[ 484.577178] x29: ffffffc7ffd6dd70 x28: 0000000000000007
[ 484.577288] x27: ffffff8009e450f0 x26: ffffff8009e450c0
[ 484.577388] x25: 0000000000000000 x24: ffffff800a11d000
[ 484.577489] x23: 0000000000000000 x22: 0000000000000040
[ 484.577857] x21: ffffffc7c81fa7c0 x20: ffffffc7c81faa38
[ 484.578300] x19: 0000000000000040 x18: 0000000000000000
[ 484.578736] x17: 0000000000000002 x16: 0000000000000003
[ 484.579173] x15: 0000000000000000 x14: 00040000004ffbf7
[ 484.581321] x13: 000000000000001b x12: 071c71c71c71c71c
[ 484.587098] x11: 000000000000000b x10: 0000000000000040
[ 484.592622] x9 : ffffff8009e64440 x8 : ffffffc7dc400000
[ 484.598648] x7 : ffffffc7dc400028 x6 : ffffffc7dc21aa40
[ 484.604160] x5 : 0000000000000004 x4 : 0000000000000000
[ 484.609495] x3 : ffffff8008bb74f8 x2 : 0000000000000028
[ 484.614833] x1 : 0000000000000040 x0 : 0000000000000001
[ 484.620171]
[ 520.576229] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0]
[ 520.576392] Modules linked in:
[ 520.576470]
[ 520.576511] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G L 4.9.140-tegra #1
[ 520.576641] Hardware name: Jetson-AGX (DT)
[ 520.576715] task: ffffff8009e513c0 task.stack: ffffff8009e40000
[ 520.576827] PC is at __do_softirq+0xb8/0x3b0
[ 520.576901] LR is at __do_softirq+0x74/0x3b0
[ 520.576976] pc : [] lr : [] pstate: 40400045
[ 520.577096] sp : ffffffc7ffd6df10
[ 520.577160] x29: ffffffc7ffd6df10 x28: ffffff8009e513c0
[ 520.577263] x27: ffffff8009e46000 x26: ffffffc7ffd6e050
[ 520.577368] x25: ffffff8009805018 x24: ffffffc7dc810800
[ 520.577477] x23: ffffff8009e43d30 x22: 0000000000000000
[ 520.577771] x21: 0000000000000000 x20: 0000000000000040
[ 520.578201] x19: ffffff8009e513c0 x18: 0000000000000006
[ 520.578645] x17: 000000000000000e x16: 0000000000000007
[ 520.579072] x15: 0000000000000001 x14: 0000000000000019
[ 520.580105] x13: 0000000000000033 x12: 000000000000004c
[ 520.585614] x11: 0000000000000068 x10: 0000000000000040
[ 520.591217] x9 : ffffff8009e64440 x8 : ffffffc7dc400000
[ 520.596989] x7 : ffffffc7dc400028 x6 : 000000000cbf59f8
[ 520.602502] x5 : 00ffffffffffffff x4 : 0000000000000015
[ 520.607839] x3 : 0000000000000001 x2 : 00000047f6565000
[ 520.613177] x1 : ffffff800a11d840 x0 : 0000000000000000
[ 520.618514]
[ 604.576227] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [swapper/0:0]
[ 604.576392] Modules linked in:
[ 604.576477]
[ 604.576521] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G L 4.9.140-tegra #1
[ 604.576653] Hardware name: Jetson-AGX (DT)
[ 604.576726] task: ffffff8009e513c0 task.stack: ffffff8009e40000
[ 604.576838] PC is at __do_softirq+0xb8/0x3b0
[ 604.576916] LR is at __do_softirq+0x74/0x3b0
[ 604.576992] pc : [] lr : [] pstate: 40400045
[ 604.577112] sp : ffffffc7ffd6df10
[ 604.577172] x29: ffffffc7ffd6df10 x28: ffffff8009e513c0
[ 604.577288] x27: ffffff8009e46000 x26: ffffffc7ffd6e050
[ 604.577391] x25: ffffff8009805018 x24: ffffffc7dc810800
[ 604.577492] x23: ffffff8009e43d30 x22: 0000000000000000
[ 604.577760] x21: 0000000000000000 x20: 0000000000000040
[ 604.578202] x19: ffffff8009e513c0 x18: 0000000000000014
[ 604.578630] x17: 000000000000000e x16: 0000000000000007
[ 604.579067] x15: 0000000000000001 x14: 0000000000000019
[ 604.579847] x13: 0000000000000033 x12: 000000000000004c
[ 604.585616] x11: 0000000000000068 x10: 0000000000000040
[ 604.591215] x9 : ffffff8009e64440 x8 : ffffffc7dc400000
[ 604.596988] x7 : ffffffc7dc400028 x6 : 000000000cbf59f8
[ 604.602501] x5 : 00ffffffffffffff x4 : 0000000000000015
[ 604.607838] x3 : 0000000000000001 x2 : 00000047f6565000
[ 604.613176] x1 : ffffff800a11d840 x0 : 0000000000000000
[ 604.618513]
[ 639.478375] bpmp: mrq 27 took 3996000 us
[ 639.485644] TCP: request_sock_TCP: Possible SYN flooding on port 26. Dropping request. Check SNMP counters.
[ 685.860213] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 685.860226] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 685.860241] 0-…: (1 ticks this GP) idle=9fd/2/0 softirq=946/946 fqs=2626
[ 685.860248] (detected by 4, t=5252 jiffies, g=163, c=162, q=10)
[ 685.860250] Task dump for CPU 0:
[ 685.860258] swapper/0 R running task 0 0 0 0x00000002
[ 685.860261] Call trace:
[ 685.860288] [] __switch_to+0x9c/0xc0
[ 685.860295] [] cpuidle_enter_state+0xa0/0x380
[ 685.860298] [] cpuidle_enter+0x34/0x48
[ 685.860303] [] call_cpuidle+0x44/0x70
[ 685.860307] [] cpu_startup_entry+0x1b0/0x200
[ 685.860313] [] rest_init+0x84/0x90
[ 685.860317] [] start_kernel+0x370/0x384
[ 685.860322] [] __primary_switched+0x80/0x94
[ 685.862459] 0-…: (1 ticks this GP) idle=9fd/2/0 softirq=946/946 fqs=2620
[ 685.862965] (detected by 7, t=5252 jiffies, g=-278, c=-279, q=1)
[ 685.863447] Task dump for CPU 0:
[ 685.866614] swapper/0 R running task 0 0 0 0x00000002
[ 685.873527] Call trace:
[ 685.876431] [] __switch_to+0x9c/0xc0
[ 685.881322] [] cpuidle_enter_state+0xa0/0x380
[ 685.887526] [] cpuidle_enter+0x34/0x48
[ 685.892606] [] call_cpuidle+0x44/0x70
[ 685.898115] [] cpu_startup_entry+0x1b0/0x200
[ 685.903718] [] rest_init+0x84/0x90
[ 685.908964] [] start_kernel+0x370/0x384
[ 685.914135] [] __primary_switched+0x80/0x94
[ 692.576210] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [swapper/0:0]
[ 692.576366] Modules linked in:
[ 692.576448]
[ 692.576487] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G L 4.9.140-tegra #1
[ 692.576611] Hardware name: Jetson-AGX (DT)
[ 692.576682] task: ffffff8009e513c0 task.stack: ffffff8009e40000
[ 692.576790] PC is at ip_finish_output2+0x230/0x410
[ 692.576873] LR is at ip_finish_output+0x128/0x200
[ 692.576949] pc : [] lr : [] pstate: 60400045
[ 692.577077] sp : ffffffc7ffd6d680
[ 692.577136] x29: ffffffc7ffd6d680 x28: ffffffc7db043000
[ 692.577245] x27: ffffffc7b48fdcf0 x26: ffffffc7db042910
[ 692.577346] x25: ffffff8009e46000 x24: ffffffc7b48e40e0
[ 692.577447] x23: ffffffc7b48e40d0 x22: 000000000000000e
[ 692.577809] x21: ffffffc7db0428e8 x20: 0000000000000000
[ 692.578231] x19: ffffffc7b48e4000 x18: 0000000000000016
[ 692.578671] x17: 0000000000000002 x16: 0000000000000003
[ 692.579117] x15: ffffffc7db062e28 x14: 0100d32d4c5c0a08
[ 692.580913] x13: 0000000000006b78 x12: 00000000000005a8
[ 692.586426] x11: ffffff80080a1ea8 x10: 000000006e9293c9
[ 692.592202] x9 : 00000000916dff47 x8 : 0000000000000000
[ 692.597976] x7 : 0000000000834987 x6 : 0000000000000003
[ 692.603489] x5 : 0000000082cb8f6f x4 : ffffffc7d9d54980
[ 692.608826] x3 : 00084fea0110a000 x2 : ba1a0110a0000000
[ 692.614164] x1 : 00000000000000c2 x0 : 0000000000000000
[ 692.619499]
[ 722.592214] INFO: rcu_preempt self-detected stall on CPU
[ 722.592364] 0-…: (1 GPs behind) idle=aa9/2/0 softirq=946/947 fqs=2618
[ 722.592488] (t=5250 jiffies g=164 c=163 q=28)
[ 722.592574] Task dump for CPU 0:
[ 722.592660] swapper/0 R running task 0 0 0 0x00000002
[ 722.592787] Call trace:
[ 722.592848] [] dump_backtrace+0x0/0x198
[ 722.592941] [] show_stack+0x24/0x30
[ 722.593036] [] sched_show_task+0xf8/0x148
[ 722.593134] [] dump_cpu_task+0x48/0x58
[ 722.593233] [] rcu_dump_cpu_stacks+0xb8/0xec
[ 722.593336] [] rcu_check_callbacks+0x728/0xa48
[ 722.593459] [] update_process_times+0x34/0x60
[ 722.593563] [] tick_sched_handle.isra.5+0x38/0x70
[ 722.594010] [] tick_sched_timer+0x4c/0x90
[ 722.594446] [] __hrtimer_run_queues+0xd8/0x360
[ 722.594904] [] hrtimer_interrupt+0xa8/0x1e0
[ 722.595335] [] arch_timer_handler_phys+0x38/0x58
[ 722.599909] [] handle_percpu_devid_irq+0x90/0x2b0
[ 722.606114] [] generic_handle_irq+0x34/0x50
[ 722.611711] [] __handle_domain_irq+0x68/0xc0
[ 722.617574] [] gic_handle_irq+0x5c/0xb0
[ 722.622911] [] el1_irq+0xe8/0x18c
[ 722.627728] [] irq_exit+0xd0/0x118
[ 722.632710] [] __handle_domain_irq+0x6c/0xc0
[ 722.638661] [] gic_handle_irq+0x5c/0xb0
[ 722.644174] [] el1_irq+0xe8/0x18c
[ 722.648993] [] cpuidle_enter_state+0xb8/0x380
[ 722.655023] [] cpuidle_enter+0x34/0x48
[ 722.660539] [] call_cpuidle+0x44/0x70
[ 722.665786] [] cpu_startup_entry+0x1b0/0x200
[ 722.671743] [] rest_init+0x84/0x90
[ 722.676729] [] start_kernel+0x370/0x384
[ 722.682240] [] __primary_switched+0x80/0x94
[ 722.687844] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 722.687855] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 722.687885] 0-…: (1 GPs behind) idle=aa9/2/0 softirq=946/947 fqs=2618
[ 722.687892] (detected by 3, t=5273 jiffies, g=-277, c=-278, q=4)
[ 722.687894] Task dump for CPU 0:
[ 722.687900] swapper/0 R running task 0 0 0 0x00000002
[ 722.687903] Call trace:
[ 722.687911] [] __switch_to+0x9c/0xc0
[ 722.687917] [] cpuidle_enter_state+0xa0/0x380
[ 722.687921] [] cpuidle_enter+0x34/0x48
[ 722.687924] [] call_cpuidle+0x44/0x70
[ 722.687941] [] cpu_startup_entry+0x1b0/0x200
[ 722.687945] [] rest_init+0x84/0x90
[ 722.687950] [] start_kernel+0x370/0x384
[ 722.687966] [] __primary_switched+0x80/0x94
[ 722.768614] 0-…: (1 GPs behind) idle=aa9/2/0 softirq=946/947 fqs=2628
[ 722.775165] (detected by 5, t=5296 jiffies, g=164, c=163, q=28)
[ 722.781372] Task dump for CPU 0:
[ 722.784527] swapper/0 R running task 0 0 0 0x00000002
[ 722.791702] Call trace:
[ 722.794589] [] __switch_to+0x9c/0xc0
[ 722.799752] [] cpuidle_enter_state+0xa0/0x380
[ 722.805700] [] cpuidle_enter+0x34/0x48
[ 722.811038] [] call_cpuidle+0x44/0x70
[ 722.816287] [] cpu_startup_entry+0x1b0/0x200
[ 722.822151] [] rest_init+0x84/0x90
[ 722.826878] [] start_kernel+0x370/0x384
[ 722.832561] [] __primary_switched+0x80/0x94

Can it be that the CPU stall results in this igb errors as the workqueue of the device can not be handled? We have many other systems (also x86) with working IGB I210 devices, where we never saw a problem at low temperatures.
Any suggestions what we could try?

linuxdev · July 29, 2020, 7:03pm

It is possible that other errors cause something else to fail to respond, but from the error messages it is hard to say what the original cause is.

This one does look interesting:

Do you have the Jetson working for a bit of time prior to failures? If you have some time, then perhaps on serial console, or through some other method which leaves the last response, run “watch -n 1 ifconfig ...that ethernet device name...”

For example, if the device name of your ethernet is “eth1”, then:
watch -n 1 ifconfig eth1
…then post the last result prior to failure.

There are certain network misconfiguration issues which could cause an excess of work, and that might find a weakness resulting in something similar to what you are seeing even if it isn’t an outright bug or hardware failure. The mention of SYN flooding implies this is possibly an ethernet data driven issue. It would be nice to see what kind of network statistics show up prior to failure if normal operation works for a short time prior to failure.

WayneWWW · August 5, 2020, 4:34am

Hi,

I don’t think you need to follow the method from that page since it is for tk1. So that page is not being verified for maybe 4 years.

What is your requirement for “minimal rootfs”? I would still suggest you could use original rootfs.

sevm89 · August 6, 2020, 8:43am

Hi WayneWWW

We did now the same test with the JetPack 4.2.2 and also here we see errors at -25°C. In the dmesg the following appears:

[ 34.805127] INFO: rcu_preempt self-detected stall on CPU
[ 34.805349] 0-…: (1 GPs behind) idle=d61/140000000000002/0 softirq=1990/19 90 fqs=2242
[ 34.805502] (t=5250 jiffies g=287 c=286 q=7045)
[ 34.805602] Task dump for CPU 0:
[ 34.805611] kworker/0:1H R running task 0 2380 2 0x00000002
[ 34.805656] Call trace:
[ 34.805678] [] dump_backtrace+0x0/0x198
[ 34.805700] [] show_stack+0x24/0x30
[ 34.805710] [] sched_show_task+0xf8/0x148
[ 34.805716] [] dump_cpu_task+0x48/0x58
[ 34.805725] [] rcu_dump_cpu_stacks+0xb8/0xec
[ 34.805733] [] rcu_check_callbacks+0x728/0xa48
[ 34.805740] [] update_process_times+0x34/0x60
[ 34.805761] [] tick_sched_handle.isra.5+0x38/0x70
[ 34.805766] [] tick_sched_timer+0x4c/0x90
[ 34.805784] [] __hrtimer_run_queues+0xd8/0x360
[ 34.805789] [] hrtimer_interrupt+0xa8/0x1e0
[ 34.805809] [] arch_timer_handler_phys+0x38/0x58
[ 34.805816] [] handle_percpu_devid_irq+0x90/0x2b0
[ 34.805821] [] generic_handle_irq+0x34/0x50
[ 34.805826] [] __handle_domain_irq+0x68/0xc0
[ 34.805831] [] gic_handle_irq+0x5c/0xb0
[ 34.805835] [] el1_irq+0xe8/0x18c
[ 34.805843] [] bio_endio+0x98/0xa8
[ 34.805849] [] blk_update_request+0xac/0x3d0
[ 34.805853] [] blk_update_bidi_request+0x38/0xa0
[ 34.805858] [] blk_end_bidi_request+0x40/0x98
[ 34.805862] [] blk_end_request+0x38/0x48
[ 34.805870] [] mmc_blk_cmdq_complete_rq+0x13c/0x1c8
[ 34.805874] [] mmc_cmdq_softirq_done+0x2c/0x38
[ 34.805880] [] blk_done_softirq+0x88/0xa0
[ 34.805884] [] __do_softirq+0x13c/0x3b0
[ 34.805892] [] irq_exit+0xd0/0x118
[ 34.805896] [] __handle_domain_irq+0x6c/0xc0
[ 34.805899] [] gic_handle_irq+0x5c/0xb0
[ 34.805904] [] el1_irq+0xe8/0x18c
[ 34.805912] [] _raw_spin_unlock_irq+0x28/0x58
[ 34.805917] [] finish_task_switch+0x7c/0x1a8
[ 34.805924] [] __schedule+0x274/0x780
[ 34.805928] [] schedule+0x40/0xa8
[ 34.805935] [] worker_thread+0xd0/0x4c8
[ 34.805941] [] kthread+0xec/0xf0
[ 34.805946] [] ret_from_fork+0x10/0x40
[ 34.813124] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 34.813363] 0-…: (1 GPs behind) idle=d61/140000000000002/0 softirq=1990/19 90 fqs=2263
[ 34.813524] (detected by 5, t=5252 jiffies, g=-59, c=-60, q=172)
[ 34.813661] Task dump for CPU 0:
[ 34.813670] kworker/0:1H R running task 0 2380 2 0x00000002
[ 34.813719] Call trace:
[ 34.813751] [] __switch_to+0x9c/0xc0
[ 34.813770] [] bp_hardening_data+0x0/0x10

Other error messages:

[ 5.764532] mmc1: Enabling vmmc regulator
[ 6.584898] mmc1: Disabling vmmc regulator
[ 6.788669] mmc1: Enabling vmmc regulator
[ 6.996811] mmc1: Disabling vmmc regulator
[ 29.694728] INFO: rcu_preempt self-detected stall on CPU
[ 29.694922] 0-…: (1 GPs behind) idle=465/140000000000002/0 softirq=300/300 fqs=2214
[ 29.695064] (t=5250 jiffies g=-140 c=-141 q=69)
[ 29.695163] Task dump for CPU 0:
[ 29.695233] watchdog/0 R running task 0 12 2 0x00000002
[ 29.695389] Call trace:
[ 29.695458] [] dump_backtrace+0x0/0x198
[ 29.695562] [] show_stack+0x24/0x30
[ 29.695663] [] sched_show_task+0xf8/0x148
[ 29.695766] [] dump_cpu_task+0x48/0x58
[ 29.695870] [] rcu_dump_cpu_stacks+0xb8/0xec
[ 29.695980] [] rcu_check_callbacks+0x728/0xa48
[ 29.696089] [] update_process_times+0x34/0x60
[ 29.696198] [] tick_sched_handle.isra.5+0x38/0x70
[ 29.696656] [] tick_sched_timer+0x4c/0x90
[ 29.697089] [] __hrtimer_run_queues+0xd8/0x360
[ 29.697546] [] hrtimer_interrupt+0xa8/0x1e0
[ 29.697976] [] arch_timer_handler_phys+0x38/0x58
[ 29.703832] [] handle_percpu_devid_irq+0x90/0x2b0
[ 29.709785] [] generic_handle_irq+0x34/0x50
[ 29.715384] [] __handle_domain_irq+0x68/0xc0
[ 29.721503] [] gic_handle_irq+0x5c/0xb0
[ 29.726582] [] el1_irq+0xe8/0x18c
[ 29.731664] [] irq_exit+0xd0/0x118
[ 29.736653] [] __handle_domain_irq+0x6c/0xc0
[ 29.742589] [] gic_handle_irq+0x5c/0xb0
[ 29.748101] [] el1_irq+0xe8/0x18c
[ 29.752919] [] _raw_spin_unlock_irq+0x28/0x58
[ 29.758952] [] finish_task_switch+0x7c/0x1a8
[ 29.764992] [] __schedule+0x274/0x780
[ 29.770152] [] schedule+0x40/0xa8
[ 29.775143] [] smpboot_thread_fn+0x238/0x248
[ 29.781176] [] kthread+0xec/0xf0
[ 29.786164] [] ret_from_fork+0x10/0x40
[ 29.791504] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 29.797199] 0-…: (1 GPs behind) idle=465/140000000000002/0 softirq=300/300 fqs=2215
[ 29.805239] (detected by 4, t=5277 jiffies, g=-140, c=-141, q=69)
[ 29.811626] Task dump for CPU 0:
[ 29.814692] watchdog/0 R running task 0 12 2 0x00000002
[ 29.821955] Call trace:
[ 29.824844] [] __switch_to+0x9c/0xc0
[ 29.829752] [] bp_hardening_data+0x0/0x10
[ 29.879121] tegra-se-nvhost 15810000.se: tegra_se_probe: complete
[ 29.922843] tegra-se-nvhost 15820000.se: initialized
[ 29.923956] tegra-se-nvhost 15820000.se: tegra_se_probe: complete
[ 29.966833] tegra-se-nvhost 15830000.se: initialized
[ 29.967401] tegra-se-nvhost 15830000.se: tegra_se_probe: complete
[ 30.014820] tegra-se-nvhost 15840000.se: initialized
[ 30.015668] tegra-se-nvhost 15840000.se: tegra_se_probe: complete
[ 30.016071] hidraw: raw HID events driver (C) Jiri Kosina
[ 30.017191] usbcore: registered new interface driver usbhid
[ 30.017316] usbhid: USB HID core driver
[ 30.122978] tegra186-cam-rtcpu bc00000.rtcpu: deferring, 14800000.isp is not probed
[ 30.124188] tegra_aon c1a0000.aon: tegra aon driver probe OK
[ 30.124742] tegra186-aondbg aondbg: aondbg driver probe() OK
[ 30.125298] denver_knobs_init:MTS_VERSION:45309758
[ 30.125616] tegra19x_actmon d230000.actmon: in actmon_register()…
[ 30.135369] tegra19x_actmon d230000.actmon: initialization Completed for the device mc_all
[ 30.135882] t19x_cache tegra-cache: probed
[ 30.155492] misc nvmap: cvsram :dma coherent mem declare 0x0000000050000000,4 194304
[ 30.155646] misc nvmap: created heap cvsram base 0x0000000050000000 size (409 6KiB)
[ 30.164188] nvpmodel: initialized successfully
[ 56.570744] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [watchdog/0 :12]
[ 56.570931] Modules linked in:

[ 56.571065] CPU: 0 PID: 12 Comm: watchdog/0 Not tainted 4.9.140-tegra #1
[ 56.571181] Hardware name: Jetson-AGX (DT)
[ 56.571260] task: ffffffc7dc041c00 task.stack: ffffffc7dc04c000
[ 56.571374] PC is at __do_softirq+0xb8/0x3b0
[ 56.571476] LR is at __do_softirq+0x74/0x3b0
[ 56.571559] pc : [] lr : [] pstate: 40400 045
[ 56.571688] sp : ffffffc7ffd6df10
[ 56.571751] x29: ffffffc7ffd6df10 x28: ffffffc7dc041c00
[ 56.571873] x27: 0000000000000000 x26: ffffffc7ffd6e050
[ 56.571988] x25: ffffff8009805018 x24: ffffffc7dc810800
[ 56.572094] x23: ffffffc7dc04fbb0 x22: 0000000000000000
[ 56.572236] x21: 0000000000000000 x20: 00000000000002c2
[ 56.572678] x19: ffffffc7dc041c00 x18: 0000000000000000
[ 56.573108] x17: 000000000000000e x16: 0000000000000007
[ 56.573555] x15: 0000000000000000 x14: 0000000001647005
[ 56.573991] x13: 0000000000000000 x12: 00000000000003a2
[ 56.579286] x11: ffffff8008f699d0 x10: 0000000000000040
[ 56.584886] x9 : ffffff8009e64440 x8 : ffffffc7dc400000
[ 56.590661] x7 : ffffffc7dc400028 x6 : 000000000cc48d97
[ 56.596174] x5 : 00ffffffffffffff x4 : 0000000000000015
[ 56.601511] x3 : 0000000000000002 x2 : 00000047f6565000
[ 56.606848] x1 : ffffff800a11d840 x0 : 0000000000000000

[ 77.990033] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 71s!
[ 78.007527] Showing busy workqueues and worker pools:
[ 78.008766] workqueue events: flags=0x0
[ 78.010970] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[ 78.011145] in-flight: 1179:request_firmware_work_func
[ 78.011258] pending: push_to_pool
[ 78.012394] workqueue events_freezable: flags=0x4
[ 78.015768] pwq 14: cpus=7 node=0 flags=0x0 nice=0 active=2/256
[ 78.015928] in-flight: 751:mmc_rescan mmc_rescan
[ 78.016298] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[ 78.016457] in-flight: 985:mmc_rescan
[ 78.022738] workqueue usb_hub_wq: flags=0x4
[ 78.026830] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 78.026980] in-flight: 4:hub_event
[ 78.029457] workqueue vmstat: flags=0xc
[ 78.031273] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 78.031454] pending: vmstat_update
[ 78.038141] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=72s workers=4 idle: 1 715 811
[ 78.038527] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 18 1727
[ 78.039059] pool 14: cpus=7 node=0 flags=0x0 nice=0 hung=46s workers=3 idle: 1761 54

So it seems a general problem and not a problem of the minimal rootfs. When we changed the Jetson AGX Xavier Module of systems that showed this behavior with one that did not, the error went with the module. This makes us believe that it is not a problem of our carrier board but of the Jetson AGX Xavier Module itself. Any suggestions what we could try? What could lead to this stalls at negative temperature?

sevm89 · August 10, 2020, 1:37pm

Any news on this?
We did the tests also with running “jetson_clocks” with the same outcome.

WayneWWW · August 13, 2020, 6:21am

Hi,

Could you check it by directly putting pure jetpack release + nv devkit to -25 degree environment?

linuxdev · August 13, 2020, 4:16pm

FYI, ifconfig information would still be useful since this shows various classifications of network errors. If the network itself has issues, then it would make sense to see the error regardless of most changes.

sevm89 · August 14, 2020, 8:37am

We have tested the behavior with the Jetson AGX Xavier DevKit and a module that showed the errors. We did not see any CPU stalls.
We adapted now our device-tree and disabled sdhci@3400000 because we came across the following thread:

That really seems to improve the behavior at negative temperatures. We need more testing to see if it totally solves it. Could you explain why enabling the SD card leads to this issue? As we have a SD card holder on our design, it is not really an option for us to permanently disable the interface. Could another specific configuration of the sdhci lead to this?
Thank you.

WayneWWW · August 14, 2020, 8:50am

Hi sevm89,

That was not a real debug or solution. I just saw his sdmmc spew lots of errors so asking him to disable it first.

Could you explain why enabling the SD card leads to this issue?

And I cannot explain it to you either. If Devkit + module are working fine under -25 degree environment, it means this issue only happens to your board.

In that case, please paste your carrier board schematic here for us to review. Also, if you confirm this issue is 100% resolved after disabling sdmmc1, you could only share the schematic of sdmmc design part.

Topic		Replies	Views
Boot AGX Xavier Jetson AGX Xavier boot , board-design	10	934	April 29, 2022
Jetson Xavier AGX nvgpu_timeout_expired Jetson AGX Xavier	30	1661	December 29, 2020
Jetson AGX Xavier keeps rebooting and cannot enter the system Jetson AGX Xavier boot	10	1127	October 18, 2021
Kernel panic on Jetpack 4.6.1 Xavier NX Jetson Xavier NX kernel	17	1377	October 27, 2022
The problem of not booting after rebooting after i2c communication between jetson and arduino Jetson Xavier NX boot , i2c	18	1166	March 17, 2023
Jetson NX Carrier board power issue Jetson Xavier NX power	16	1773	October 18, 2021
Jetson AGX no longer boots - "RAMDISK: incomplete write (28583 != 29663)" Jetson AGX Xavier boot	7	1486	November 17, 2021
Jetson AGX Xavier operating temperature Jetson AGX Xavier hw	4	1763	March 9, 2022
Failed to boot the JetsonXavierNX device from the USB flash drive Jetson Xavier NX boot	31	1078	October 26, 2022
Boots into black screen with a blinking cursor after auto update Jetson Xavier NX boot , reflash , kernel	19	3408	October 18, 2021

Minimal rootfs with busybox - error messages

Related topics