Hi,
Sometimes Tegra A stops responding in several minutes after we launched all of our applications. We assume that the issue is related to the video capturing pipeline, since we narrowed down the set of applications running on Tegra A just to the camera grabber. The grabber captures 9 GMSL cameras.
The period between the launch and the crash is arbitrary long and it can take a half an hour or just two minutes.
After the camera grabber stops sending ROS messages, no SSH connections to Tegra A are possible and ping can’t reach Tegra A. The existing SSH sessions to Tegra A may operate normally just until you launch an application inside them. Tegra B functioning normally.
Other ROS nodes (running on Tegra A) may still function after the camera grabber hangs.
tegrastats doesn’t show any signs of overheating.
After turning on/off the power, the system gets back to normal.
During the latest tests, we connected and captured only one camera as a workaround. No crashes were reported during that tests.
Just before a crash happened, the camera grabber may print to standard output error messages of NvCapture complaining about the I/O chip.
I continuously read the system log over SSH by running the following command:
tail -f /var/log/syslog
Just after a crash happened, the latest printed log is always different. Here is the end of one of the logs, recorded during one of the attempts:
Sep 2 16:39:19 pegasus-tegra-a kernel: [ 4620.384077] INFO: rcu_preempt detected stalls on CPUs/tasks:
Sep 2 16:39:19 pegasus-tegra-a kernel: [ 4620.384081] INFO: rcu_sched detected stalls on CPUs/tasks:
Sep 2 16:39:19 pegasus-tegra-a kernel: [ 4620.384094] 4-...: (0 ticks this GP) idle=bc3/140000000000000/0 softirq=0/0 fqs=2268
Sep 2 16:39:19 pegasus-tegra-a kernel: [ 4620.384101] 4-...: (0 ticks this GP) idle=bc3/140000000000000/0 softirq=0/0 fqs=2229
Sep 2 16:39:19 pegasus-tegra-a kernel: [ 4620.384104]
Sep 2 16:39:19 pegasus-tegra-a kernel: [ 4620.384108]
Sep 2 16:39:19 pegasus-tegra-a kernel: [ 4620.384108] (detected by 0, t=5252 jiffies, g=129355, c=129354, q=1088)
Sep 2 16:39:19 pegasus-tegra-a kernel: (detected by 6, t=5252 jiffies, g=248875, c=248874, q=4)
Sep 2 16:39:19 pegasus-tegra-a kernel: [ 4620.384139] Task dump for CPU 4:
Sep 2 16:39:19 pegasus-tegra-a kernel: [ 4620.384142] Task dump for CPU 4:
Sep 2 16:39:19 pegasus-tegra-a kernel: [ 4620.384150] republish R
Sep 2 16:39:19 pegasus-tegra-a kernel: [ 4620.384153] republish R
Sep 2 16:39:19 pegasus-tegra-a kernel: [ 4620.384155] running task running task 0 5190 5175 0x0000000a
Sep 2 16:39:19 pegasus-tegra-a kernel: 0 5190 5175 0x0000000a
Sep 2 16:39:19 pegasus-tegra-a kernel: Call trace:
Sep 2 16:39:19 pegasus-tegra-a kernel: [ 4620.384170] Call trace:
Sep 2 16:39:19 pegasus-tegra-a kernel: [ 4620.384189] [<ffffff8008085858>] __switch_to+0x8c/0xac
Sep 2 16:39:19 pegasus-tegra-a kernel: [ 4620.384199] [<ffffff8008085858>] __switch_to+0x8c/0xac
Sep 2 16:39:19 pegasus-tegra-a kernel: [ 4620.384202] [<ffffffc5f0caba00>] 0xffffffc5f0caba00
Sep 2 16:39:19 pegasus-tegra-a kernel: [ 4620.384205] [<ffffffc5f0caba00>] 0xffffffc5f0caba00
Sep 2 16:39:19 pegasus-tegra-a kernel: [ 4641.943673] BUG: workqueue lockup - pool cpus=7 node=0 flags=0x0 nice=0 stuck for 41s!
Sep 2 16:39:19 pegasus-tegra-a kernel: [ 4641.958518] Showing busy workqueues and worker pools:
Sep 2 16:39:19 pegasus-tegra-a kernel: [ 4641.958530] workqueue writeback: flags=0x4e
Sep 2 16:39:19 pegasus-tegra-a kernel: [ 4641.958532] pwq 16: cpus=0-7 flags=0x4 nice=0 active=1/256
Sep 2 16:39:19 pegasus-tegra-a kernel: [ 4641.958542] in-flight: 5572:wb_workfn
Sep 2 16:39:19 pegasus-tegra-a kernel: [ 4641.958564] workqueue vmstat: flags=0xc
Sep 2 16:39:19 pegasus-tegra-a kernel: [ 4641.958566] pwq 14: cpus=7 node=0 flags=0x0 nice=0 active=1/256
Sep 2 16:39:19 pegasus-tegra-a kernel: [ 4641.958574] pending: vmstat_update
Sep 2 16:39:19 pegasus-tegra-a kernel: [ 4641.958636] pool 16: cpus=0-7 flags=0x4 nice=0 hung=0s workers=4 idle: 5624 5457 5497
Sep 2 16:39:49 pegasus-tegra-a kernel: [ 4672.663089] BUG: workqueue lockup - pool cpus=7 node=0 flags=0x0 nice=0 stuck for 72s!
Sep 2 16:39:49 pegasus-tegra-a kernel: [ 4672.666833] Showing busy workqueues and worker pools:
Sep 2 16:39:49 pegasus-tegra-a kernel: [ 4672.666880] workqueue vmstat: flags=0xc
Sep 2 16:39:49 pegasus-tegra-a kernel: [ 4672.666882] pwq 14: cpus=7 node=0 flags=0x0 nice=0 active=1/256
Sep 2 16:39:49 pegasus-tegra-a kernel: [ 4672.666893] pending: vmstat_update
Sep 2 16:40:20 pegasus-tegra-a kernel: [ 4683.402883] INFO: rcu_preempt detected stalls on CPUs/tasks:
Sep 2 16:40:20 pegasus-tegra-a kernel: [ 4683.402899] INFO: rcu_sched detected stalls on CPUs/tasks:
Sep 2 16:40:20 pegasus-tegra-a kernel: [ 4683.402900] 4-...: (0 ticks this GP) idle=bc3/140000000000000/0 softirq=0/0 fqs=9089
Sep 2 16:40:20 pegasus-tegra-a kernel: [ 4683.402943]
Sep 2 16:40:20 pegasus-tegra-a kernel: [ 4683.402944] 4-...: (0 ticks this GP) idle=bc3/140000000000000/0 softirq=0/0 fqs=9174
Sep 2 16:40:20 pegasus-tegra-a kernel: [ 4683.402948] (detected by 3, t=21007 jiffies, g=129355, c=129354, q=4036)
Sep 2 16:40:20 pegasus-tegra-a kernel: [ 4683.402951]
Sep 2 16:40:20 pegasus-tegra-a kernel: [ 4683.402951] Task dump for CPU 4:
Sep 2 16:40:20 pegasus-tegra-a kernel: [ 4683.402957] (detected by 5, t=21007 jiffies, g=248875, c=248874, q=4)
Sep 2 16:40:20 pegasus-tegra-a kernel: [ 4683.402964] republish R
Sep 2 16:40:20 pegasus-tegra-a kernel: [ 4683.402964] Task dump for CPU 4:
Sep 2 16:40:20 pegasus-tegra-a kernel: [ 4683.402972] running task <6>[ 4683.402977] republish R
Sep 2 16:40:20 pegasus-tegra-a kernel: [ 4683.402978] 0 5190 5175 0x0000000a
Sep 2 16:40:20 pegasus-tegra-a kernel: running task Call trace:
Sep 2 16:40:20 pegasus-tegra-a kernel: [ 4683.402986] 0 5190 5175 0x0000000a
Sep 2 16:40:20 pegasus-tegra-a kernel: [<ffffff8008085858>] __switch_to+0x8c/0xac
Sep 2 16:40:20 pegasus-tegra-a kernel: [ 4683.403000] Call trace:
Sep 2 16:40:20 pegasus-tegra-a kernel: [ 4683.403003] [<ffffffc5f0caba00>] 0xffffffc5f0caba00
Sep 2 16:40:20 pegasus-tegra-a kernel: [ 4683.403013] [<ffffff8008085858>] __switch_to+0x8c/0xac
Sep 2 16:40:20 pegasus-tegra-a kernel: [ 4683.403016] [<ffffffc5f0caba00>] 0xffffffc5f0caba00
Sep 2 16:40:20 pegasus-tegra-a kernel: [ 4703.382818] BUG: workqueue lockup - pool cpus=7 node=0 flags=0x0 nice=0 stuck for 103s!
Sep 2 16:40:20 pegasus-tegra-a kernel: [ 4703.396666] Showing busy workqueues and worker pools:
Sep 2 16:40:20 pegasus-tegra-a kernel: [ 4703.396688] workqueue vmstat: flags=0xc
Sep 2 16:40:20 pegasus-tegra-a kernel: [ 4703.396690] pwq 14: cpus=7 node=0 flags=0x0 nice=0 active=1/256
Sep 2 16:40:20 pegasus-tegra-a kernel: [ 4703.396699] pending: vmstat_update
Sep 2 16:40:20 pegasus-tegra-a rsyslogd-2007: action 'action 9' suspended, next retry is Mon Sep 2 16:41:50 2019 [v8.16.0 try http://www.rsyslog.com/e/2007 ]
Sep 2 16:40:51 pegasus-tegra-a kernel: [ 4734.102233] BUG: workqueue lockup - pool cpus=7 node=0 flags=0x0 nice=0 stuck for 134s!
Sep 2 16:40:51 pegasus-tegra-a kernel: [ 4734.105874] Showing busy workqueues and worker pools:
Sep 2 16:40:51 pegasus-tegra-a kernel: [ 4734.105907] workqueue vmstat: flags=0xc
Sep 2 16:40:51 pegasus-tegra-a kernel: [ 4734.105909] pwq 14: cpus=7 node=0 flags=0x0 nice=0 active=1/256
Sep 2 16:40:51 pegasus-tegra-a kernel: [ 4734.105919] pending: vmstat_update
Sep 2 16:41:22 pegasus-tegra-a kernel: [ 4746.422002] INFO: rcu_sched detected stalls on CPUs/tasks:
Sep 2 16:41:22 pegasus-tegra-a kernel: [ 4746.422005] INFO: rcu_preempt detected stalls on CPUs/tasks:
Sep 2 16:41:22 pegasus-tegra-a kernel: [ 4746.422022] 4-...: (0 ticks this GP) idle=bc3/140000000000000/0 softirq=0/0 fqs=15751