Argus camera maybe deadlock when start with SCHED_FIFO policy

We use argus library to develop the image application with ar0144 camera on TX2. It worked well before.

But recently we modified the scheduling policy (SCHED_OTHER —> SCHED_FIFO) and raise priority(-81) of our application, then we found that sometimes our application cannot get the image and the cpuload is about ~100%, the case’s probability is about 1/20.

wo did some debug work: there are 3 threads occupy cpu,and some other threads called pthread_mutex_lock().

Is this case related to the scheduling strategy?
Is there related solution?

Our system:

  • NVIDIA Jetson TX2
    • Jetpack 4.3 [L4T 32.3.1]
    • NV Power Mode: MAXN - Type: 0



How to reproduce it?

Hello ShaneCCC,
Here is some demo code and tools.
argus_camera_agent.zip (1000.9 KB)
unzip it and the reproduce step is in readme.md

we use stress_argus.sh to stress argus_camera_agent

when argus_camera_agent start with cache_fresh.sh, argus_camera_agent will lockup with high cpuload.

But:

  1. if do not start cache_fresh.sh in stress_argus.sh , it works well;
    or
  2. if comment out the process priority in argus_camera_agent.cpp(line 372), it works well;

Hello ShaneCCC,
Is there an update on this question?

We are trying to reproduce it on reference sensor board now.

I try to build it on r32.5 and run it got below error.

root@nvidia-desktop:/home/nvidia/sched_fifo/build# ./argus_camera_agent
No protocol specified
nvbuf_utils: Could not get EGL display connection
Segmentation fault (core dumped)

sudo ./argus_camera_agent [camera_idx]

in our system, camera is /dev/video1 node, so camera_idx=1:

sudo ./argus_camera_agent 1

more detailed description in readme.md and stress_argus.sh

Recently, we have some new update on this issue. It seems that only after setting the SCHED_FIFO priority and starting the cache_fresh.sh script , there will be a stuck situation. which is:

Got below error for stress_argus.sh

nvidia@nvidia-desktop:~/sched_fifo$ sudo bash stress_argus.sh
kernel.sched_rt_runtime_us = -1
2 20210406222028
taskset: failed to set pid 19141’s affinity: Invalid argument
3 20210406222033
taskset: failed to set pid 19144’s affinity: Invalid argument

thanks for testing.
Is cpu1 and cpu2 not online on your board?
can you modify line 12 of stress_argus.sh to

taskset -c 0 nohup ./build/argus_camera_agent 1 60 > ./stress_argus_log/$dd.log

and retry?

Just run the stress on r32.5.1/TX2, it can run 515 loops without problem.

Dear ShaneCCC,

could you make sure argus_camera_agent’s PR setting is valid?

top -d 1 -n 1000 | grep argus

if argus_camera_agent’s PR is such as -81,the priority setting is successful .

Yes, it is -81

nvidia@nvidia-desktop:~$ top -d 1 -n 1000 | grep argus
19653 root -81 0 16.032g 61916 25468 D 19.6 0.8 0:00.22 argus_camera_ag
19653 root -81 0 16.511g 78968 29516 S 6.8 1.0 0:00.29 argus_camera_ag

The problem occurs almost within 10 loops with stress script。
I don’t know the difference between our testing environment。

I will do more tests to check or Do you have any suggestions?

Thank you。

Could you try on 32.5.1?

Yes, I will try it

Hi ShaneCCC:
I have done the same test in 32.5.1, and the lockup situation still appear same as 32.3.1.

  • Jetpack 4.5.1 [L4T 32.5.1]

I don’t have any debug ideas.
Do you have any other test suggestions?

Thanks.

Did you verify by reference sensor ov5693?

No, We don’t have ov5693 on hand。
I am testing with sensor ar0144 and ar0234.

I think if you buy the TX2 devkit ov5693 should be default mount on it?