Tegra A crashes, when using multiple cameras

AndrewKr · October 31, 2019, 7:43am

Hi,

Sometimes Tegra A stops responding in several minutes after we launched all of our applications. We assume that the issue is related to the video capturing pipeline, since we narrowed down the set of applications running on Tegra A just to the camera grabber. The grabber captures 9 GMSL cameras.

The period between the launch and the crash is arbitrary long and it can take a half an hour or just two minutes.

After the camera grabber stops sending ROS messages, no SSH connections to Tegra A are possible and ping can’t reach Tegra A. The existing SSH sessions to Tegra A may operate normally just until you launch an application inside them. Tegra B functioning normally.

Other ROS nodes (running on Tegra A) may still function after the camera grabber hangs.

tegrastats doesn’t show any signs of overheating.

After turning on/off the power, the system gets back to normal.

During the latest tests, we connected and captured only one camera as a workaround. No crashes were reported during that tests.

Just before a crash happened, the camera grabber may print to standard output error messages of NvCapture complaining about the I/O chip.

I continuously read the system log over SSH by running the following command:

tail -f /var/log/syslog

Just after a crash happened, the latest printed log is always different. Here is the end of one of the logs, recorded during one of the attempts:

Sep  2 16:39:19 pegasus-tegra-a kernel: [ 4620.384077] INFO: rcu_preempt detected stalls on CPUs/tasks:
Sep  2 16:39:19 pegasus-tegra-a kernel: [ 4620.384081] INFO: rcu_sched detected stalls on CPUs/tasks:
Sep  2 16:39:19 pegasus-tegra-a kernel: [ 4620.384094] 	4-...: (0 ticks this GP) idle=bc3/140000000000000/0 softirq=0/0 fqs=2268 
Sep  2 16:39:19 pegasus-tegra-a kernel: [ 4620.384101] 	4-...: (0 ticks this GP) idle=bc3/140000000000000/0 softirq=0/0 fqs=2229 
Sep  2 16:39:19 pegasus-tegra-a kernel: [ 4620.384104] 	
Sep  2 16:39:19 pegasus-tegra-a kernel: [ 4620.384108] 	
Sep  2 16:39:19 pegasus-tegra-a kernel: [ 4620.384108] (detected by 0, t=5252 jiffies, g=129355, c=129354, q=1088)
Sep  2 16:39:19 pegasus-tegra-a kernel: (detected by 6, t=5252 jiffies, g=248875, c=248874, q=4)
Sep  2 16:39:19 pegasus-tegra-a kernel: [ 4620.384139] Task dump for CPU 4:
Sep  2 16:39:19 pegasus-tegra-a kernel: [ 4620.384142] Task dump for CPU 4:
Sep  2 16:39:19 pegasus-tegra-a kernel: [ 4620.384150] republish       R
Sep  2 16:39:19 pegasus-tegra-a kernel: [ 4620.384153] republish       R
Sep  2 16:39:19 pegasus-tegra-a kernel: [ 4620.384155]   running task      running task        0  5190   5175 0x0000000a
Sep  2 16:39:19 pegasus-tegra-a kernel:    0  5190   5175 0x0000000a
Sep  2 16:39:19 pegasus-tegra-a kernel: Call trace:
Sep  2 16:39:19 pegasus-tegra-a kernel: [ 4620.384170] Call trace:
Sep  2 16:39:19 pegasus-tegra-a kernel: [ 4620.384189] [<ffffff8008085858>] __switch_to+0x8c/0xac
Sep  2 16:39:19 pegasus-tegra-a kernel: [ 4620.384199] [<ffffff8008085858>] __switch_to+0x8c/0xac
Sep  2 16:39:19 pegasus-tegra-a kernel: [ 4620.384202] [<ffffffc5f0caba00>] 0xffffffc5f0caba00
Sep  2 16:39:19 pegasus-tegra-a kernel: [ 4620.384205] [<ffffffc5f0caba00>] 0xffffffc5f0caba00
Sep  2 16:39:19 pegasus-tegra-a kernel: [ 4641.943673] BUG: workqueue lockup - pool cpus=7 node=0 flags=0x0 nice=0 stuck for 41s!
Sep  2 16:39:19 pegasus-tegra-a kernel: [ 4641.958518] Showing busy workqueues and worker pools:
Sep  2 16:39:19 pegasus-tegra-a kernel: [ 4641.958530] workqueue writeback: flags=0x4e
Sep  2 16:39:19 pegasus-tegra-a kernel: [ 4641.958532]   pwq 16: cpus=0-7 flags=0x4 nice=0 active=1/256
Sep  2 16:39:19 pegasus-tegra-a kernel: [ 4641.958542]     in-flight: 5572:wb_workfn
Sep  2 16:39:19 pegasus-tegra-a kernel: [ 4641.958564] workqueue vmstat: flags=0xc
Sep  2 16:39:19 pegasus-tegra-a kernel: [ 4641.958566]   pwq 14: cpus=7 node=0 flags=0x0 nice=0 active=1/256
Sep  2 16:39:19 pegasus-tegra-a kernel: [ 4641.958574]     pending: vmstat_update
Sep  2 16:39:19 pegasus-tegra-a kernel: [ 4641.958636] pool 16: cpus=0-7 flags=0x4 nice=0 hung=0s workers=4 idle: 5624 5457 5497
Sep  2 16:39:49 pegasus-tegra-a kernel: [ 4672.663089] BUG: workqueue lockup - pool cpus=7 node=0 flags=0x0 nice=0 stuck for 72s!
Sep  2 16:39:49 pegasus-tegra-a kernel: [ 4672.666833] Showing busy workqueues and worker pools:
Sep  2 16:39:49 pegasus-tegra-a kernel: [ 4672.666880] workqueue vmstat: flags=0xc
Sep  2 16:39:49 pegasus-tegra-a kernel: [ 4672.666882]   pwq 14: cpus=7 node=0 flags=0x0 nice=0 active=1/256
Sep  2 16:39:49 pegasus-tegra-a kernel: [ 4672.666893]     pending: vmstat_update
Sep  2 16:40:20 pegasus-tegra-a kernel: [ 4683.402883] INFO: rcu_preempt detected stalls on CPUs/tasks:
Sep  2 16:40:20 pegasus-tegra-a kernel: [ 4683.402899] INFO: rcu_sched detected stalls on CPUs/tasks:
Sep  2 16:40:20 pegasus-tegra-a kernel: [ 4683.402900] 	4-...: (0 ticks this GP) idle=bc3/140000000000000/0 softirq=0/0 fqs=9089 
Sep  2 16:40:20 pegasus-tegra-a kernel: [ 4683.402943] 	
Sep  2 16:40:20 pegasus-tegra-a kernel: [ 4683.402944] 	4-...: (0 ticks this GP) idle=bc3/140000000000000/0 softirq=0/0 fqs=9174 
Sep  2 16:40:20 pegasus-tegra-a kernel: [ 4683.402948] (detected by 3, t=21007 jiffies, g=129355, c=129354, q=4036)
Sep  2 16:40:20 pegasus-tegra-a kernel: [ 4683.402951] 	
Sep  2 16:40:20 pegasus-tegra-a kernel: [ 4683.402951] Task dump for CPU 4:
Sep  2 16:40:20 pegasus-tegra-a kernel: [ 4683.402957] (detected by 5, t=21007 jiffies, g=248875, c=248874, q=4)
Sep  2 16:40:20 pegasus-tegra-a kernel: [ 4683.402964] republish       R
Sep  2 16:40:20 pegasus-tegra-a kernel: [ 4683.402964] Task dump for CPU 4:
Sep  2 16:40:20 pegasus-tegra-a kernel: [ 4683.402972]   running task    <6>[ 4683.402977] republish       R
Sep  2 16:40:20 pegasus-tegra-a kernel: [ 4683.402978]     0  5190   5175 0x0000000a
Sep  2 16:40:20 pegasus-tegra-a kernel:  running task    Call trace:
Sep  2 16:40:20 pegasus-tegra-a kernel: [ 4683.402986]     0  5190   5175 0x0000000a
Sep  2 16:40:20 pegasus-tegra-a kernel: [<ffffff8008085858>] __switch_to+0x8c/0xac
Sep  2 16:40:20 pegasus-tegra-a kernel: [ 4683.403000] Call trace:
Sep  2 16:40:20 pegasus-tegra-a kernel: [ 4683.403003] [<ffffffc5f0caba00>] 0xffffffc5f0caba00
Sep  2 16:40:20 pegasus-tegra-a kernel: [ 4683.403013] [<ffffff8008085858>] __switch_to+0x8c/0xac
Sep  2 16:40:20 pegasus-tegra-a kernel: [ 4683.403016] [<ffffffc5f0caba00>] 0xffffffc5f0caba00
Sep  2 16:40:20 pegasus-tegra-a kernel: [ 4703.382818] BUG: workqueue lockup - pool cpus=7 node=0 flags=0x0 nice=0 stuck for 103s!
Sep  2 16:40:20 pegasus-tegra-a kernel: [ 4703.396666] Showing busy workqueues and worker pools:
Sep  2 16:40:20 pegasus-tegra-a kernel: [ 4703.396688] workqueue vmstat: flags=0xc
Sep  2 16:40:20 pegasus-tegra-a kernel: [ 4703.396690]   pwq 14: cpus=7 node=0 flags=0x0 nice=0 active=1/256
Sep  2 16:40:20 pegasus-tegra-a kernel: [ 4703.396699]     pending: vmstat_update
Sep  2 16:40:20 pegasus-tegra-a rsyslogd-2007: action 'action 9' suspended, next retry is Mon Sep  2 16:41:50 2019 [v8.16.0 try http://www.rsyslog.com/e/2007 ]
Sep  2 16:40:51 pegasus-tegra-a kernel: [ 4734.102233] BUG: workqueue lockup - pool cpus=7 node=0 flags=0x0 nice=0 stuck for 134s!
Sep  2 16:40:51 pegasus-tegra-a kernel: [ 4734.105874] Showing busy workqueues and worker pools:
Sep  2 16:40:51 pegasus-tegra-a kernel: [ 4734.105907] workqueue vmstat: flags=0xc
Sep  2 16:40:51 pegasus-tegra-a kernel: [ 4734.105909]   pwq 14: cpus=7 node=0 flags=0x0 nice=0 active=1/256
Sep  2 16:40:51 pegasus-tegra-a kernel: [ 4734.105919]     pending: vmstat_update
Sep  2 16:41:22 pegasus-tegra-a kernel: [ 4746.422002] INFO: rcu_sched detected stalls on CPUs/tasks:
Sep  2 16:41:22 pegasus-tegra-a kernel: [ 4746.422005] INFO: rcu_preempt detected stalls on CPUs/tasks:
Sep  2 16:41:22 pegasus-tegra-a kernel: [ 4746.422022] 	4-...: (0 ticks this GP) idle=bc3/140000000000000/0 softirq=0/0 fqs=15751

SivaRamaKrishnaNV · October 31, 2019, 8:55am

Dear itriA30220,
Could you share details about the SW stack you are using(like DW, NVMEDIA) and use case?
Do you see any issue with sample_camera_multiple_gmsl in DW when using 9 cameras?
Do you notice this issue when connected more than 1 camera?

kayccc · November 8, 2019, 6:23am

Hi itriA30220,

Have you managed to get issue resolved?
Any result can be shared?

AndrewKr · November 12, 2019, 10:12am

Hi guys,

We are using Driveworks 1.5 for capturing, NVMEDIA for video compression. The platform was flashed at the end of the May with the latest version of SDK Manager (presumably 0.9.9).
The use case is surrounding video capturing with 9 GMSL cameras (1080p, 30 FPS) on a vehicle.
So far, the only tests we done were 9 cameras or 1 camera continuous capturing. Each session was shorter than half hour. With 1 camera, no issues were observed.
We speculate that the issue is a result of overheating of the camera capturing chip. Is there any way to verify it?
Right now we are working on reproduction of the issue inside office.
We will run the sample_camera_mulitple_gmsl and let you know the outcome later.

Sorry for the late reply. I was on a business trip.

Best regards,
Andrey

kayccc · November 22, 2019, 3:17am

Hi itriA30220,

The DRIVE Software 10.0 is now available to download. New features include sensor plugin support for RADAR, GNSS, and CAN, new image processing modules, and improved DNNs for calculating time-to-collision, detecting multiple intersections per frame, and more. Please migrate to this new version. Download Now

AndrewKr · November 26, 2019, 10:38am

Hi, kayccc.

We have tested sample_camera_multiple_gmsl 9 cameras with Driverworks 1.5 and 2.0. Also tested our camera grabber software module and another team’s software module. But non of the tests were able to reproduce the abnormal behavior.

Our platform were taken for RMA, due to another issue: https://devtalk.nvidia.com/default/topic/1065771/general/can-t-discover-dgpu/. We can’t proceed with further tests on this platform.

Best regards,
Andrey

LukeNV · December 31, 2019, 8:03pm

Hi Andrey,

Have you received a new or repaired unit yet to re-test this?

AndrewKr · January 7, 2020, 9:47am

Hi, LukeNV.

No, we haven’t sent the old unit yet. We are still using it for testing non-camera related purposes.

Best regards,
Andrey