Argus camera cause linux system locking up and restart abnormally

yxz1295324 · June 3, 2021, 7:03am

Hi
I meet a very strange issue in developing argus related application！
I set the process priority of the argus application to 80 and bind the argus application to cpu2. And I use multiple daemon process to repeatedly restart multiple argus related applications. One daemon will start a argus application if the related argus app exit.
We don’t connect any camera to the TX2, after several minutes，the linux system will lockup deadly and restart abnormally!!

But if don’t set process priority for the argus application, the linux system work well and don’t restart.
the argus application:
argus_camera_agent_stress.tar.gz (637.3 KB)
The environment are list as below:

Hardware: TX2, and don’t connect any camera
File system : Jetpack 4.3
who know or meet the same issue? thank you very much!
B.R.

JerryChang · June 3, 2021, 7:16am

may I know what’s your actual use-case.
why you would like to bind the Argus application to cpu2 if you don’t even have camera connected.

yxz1295324 · June 3, 2021, 7:36am

Hi, JerryChang
This is our test case, our origin argus app is related to many other business code
Before in our test, we meet the case the TX2 may restart abnormally if the datalink between camera and TX2 is disconnected because of some unknown reason.
So we use the test code to reappear the scene.

JerryChang · June 3, 2021, 7:43am

hello yxz1295324,

what’s the scenario to make your stream interrupt;
is this something like physically disconnect camera device while streaming?

yxz1295324 · June 3, 2021, 7:54am

Hi JerryChang,
Actually, in out case, sometimes our argus apps can’t get the image, so we think the datalink is disconnected, and then we restart our argus apps.
So we simply don’t connect any camera to our system to simulate the scenario！

JerryChang · June 3, 2021, 9:08am

hello yxz1295324,

there’s change for Argus Error Resiliency,
it’s change to include error handling for the camera stack,
please refer to Jetson/L4T/r32.3.x patches - eLinux.org to apply camera patches.

the Argus Error Resiliency patches allow user-space crash, it could handles error events and shutdown the app gracefully.
you need to have implementation on application side to handle it.
please also refer to this topic for more details, Topic 170086.
thanks

yxz1295324 · June 3, 2021, 1:58pm

Hi, JerryChang
Thank you very much!
We will try with your method.
B.R.

yxz1295324 · June 4, 2021, 8:24am

Hi, JerryChang
I had a adopt your method on our TX2 board.
But unfortunately, the method can’t fix the issue.
B.R.

JerryChang · June 4, 2021, 8:29am

hello yxz1295324,

please setup serial console to gather the logs, please collect the detail logs for reference,
thanks

yxz1295324 · June 4, 2021, 8:58am

Hi, JerryChang
the serial console output the logs as below:
[ 2.354485] cgroup: cgroup2: unknown option “nsdelegate”
[ 3.899698] using random self ethernet address
[ 3.904284] using random host ethernet address
[ 4.166627] random: crng init done
[ 4.170054] random: 7 urandom warning(s) missed due to ratelimiting
[ 4.707366] using random self ethernet address
[ 4.713851] using random host ethernet address

Ubuntu 18.04.3 LTS localhost ttyS0

localhost login: [ 25.126961] nvgpu: 17000000.gp10b __nvgpu_timeout_expired_msg_cpu:94 [ERR] Timeout detected @ gk20a_fifo_is_preempt_pending+0x68/0x128 [nvgpu]
[ 25.140394] nvgpu: 17000000.gp10b gk20a_fifo_is_preempt_pending:2975 [ERR] preempt timeout: id: 16 id_type: 1
[ 25.151009] nvgpu: 17000000.gp10b gk20a_fifo_preempt_tsg:3099 [ERR] preempt timed out for tsgid: 16, ctxsw timeout will trigger recovery if needed
[ 25.165416] nvgpu: 17000000.gp10b gr_gk20a_ctx_zcull_setup:898 [ERR] failed to preempt channel/TSG
[ 93.266581] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 51s!
[ 93.274541] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=-20 stuck for 47s!
[ 124.102595] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 82s!
[ 124.110568] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=-20 stuck for 78s!
[ 154.782596] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 113s!
[ 154.790652] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=-20 stuck for 109s!
[ 185.550590] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 144s!
[ 185.558654] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=-20 stuck for 139s!
[ 216.154593] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 174s!
[ 216.162652] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=-20 stuck for 170s!
[ 242.766673] INFO: task kworker/0:3:4889 blocked for more than 120 seconds.
[ 242.773713] Not tainted 4.9.140 #2
[ 242.777699] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
[ 242.786263] Kernel panic - not syncing: hung_task: blocked tasks
[ 242.792271] CPU: 1 PID: 675 Comm: khungtaskd Not tainted 4.9.140 #2
[ 242.798529] Hardware name: quill (DT)
[ 242.802187] Call trace:
[ 242.804654] [] dump_backtrace+0x0/0x198
[ 242.810061] [] show_stack+0x24/0x30
[ 242.815119] [] dump_stack+0x98/0xc0
[ 242.820168] [] panic+0x11c/0x298
[ 242.824965] [] watchdog+0x300/0x3b8
[ 242.830015] [] kthread+0xec/0xf0
[ 242.834804] [] ret_from_fork+0x10/0x40
[ 242.840116] SMP: stopping secondary CPUs
[ 242.844045] Kernel Offset: disabled
[ 242.847531] Memory Limit: none
[ 242.850583] trusty-log panic notifier - trusty version Built: 18:08:29 Apr 8 2020 [ 242.868319] Rebooting in 5 seconds…
[0001.078] I> Welcome to MB2(TBoot-BPMP)(version: 01.00.160913-t186-M-00.00-mobile-050b21a4)
[0001.086] I> bit @ 0xd480000
[0001.089] I> Boot-device: eMMC

Also, we can use the below test code to reappear.
argus_camera_agent_stress.tar.gz

just uncompress the file and execute below command:
bash run.sh

B.R.

JerryChang · June 7, 2021, 3:18am

hello yxz1295324,

I have a question regarding to your test app, argus_camera_agent.cpp.
it seems you’re having #if 0 to disable buffer handling in the camera stack; what’s the actual scenario for running this. you would like to allocate buffers but not using them actually?

BTW,
your test scripts, which continuously enable application.

    ##---------- camera related -----------------------------------------------
    nohup bash ./argus_daemon.sh > /dev/null &
    sleep 2
    nohup bash ./argus_daemon.sh > /dev/null &
    sleep 2
...

I do not see you have code snippet to unregister camera device, (i.e. unallocated the handler.)
so, the 2nd process should not works because camera device is occupied by your 1st process.

yxz1295324 · June 9, 2021, 3:10am

Hi, JerryChang
I’m sorry that I reply a little late!
Actually, disable and not enable buffer handling in the camera stack, the issues are same and can reappear. So we simply use #if 0 to diablo buffering handling in order to reappear the issue more faster
B.R.

JerryChang · June 9, 2021, 8:49am

hello yxz1295324,

nevertheless, it’s not reasonable to access camera app without connect any camera.

here’s another proposal for your verification,
instead of access camera app without camera devices, please have software simulated methods to force-stop the sensor stream to test your use-case. you’re able to force stop video stream via sysnode as following.
i.e. # echo 0 > /sys/kernel/debug/camera-video0/streaming
please test this with the error handling applied, as the changes mentioned in comment #7.
thanks