Nvargus-daemon crash

Hi

We are running our GStreamer based application on Jetson Nano.

System information:

$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION=“Ubuntu 18.04.5 LTS”
$ cat /etc/nv_tegra_release
# R32 (release), REVISION: 4.4, GCID: 23942405, BOARD: t210ref, EABI: aarch64, DATE: Fri Oct 16 19:44:43 UTC 2020

During long running we saw that the nvargus-daemon is crashing, we are chasing this for a while and now we were able to catch it with logs and recording.

So the system is started almost a 24 hours before:

$ uptime
15:06:24 up 1 day, 1:09, 9 users, load average: 2,00, 2,35, 2,50

Based on the logs the crash is generated around 14:46:28. Based on the recording the issues started around 14:46:16 with large purple stripes on the video and some noisy pixels:

After this the screen gone black.

The backtarce in the coredump is not telling me a lot:

Using host libthread_db library “/lib/aarch64-linux-gnu/libthread_db.so.1”.
Core was generated by `/usr/sbin/nvargus-daemon’.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0001000003000300 in ?? ()
[Current thread is 1 (Thread 0x7f8e2a01d0 (LWP 26451))]
(gdb) bt
#0 0x0001000003000300 in ?? ()
#1 0x0000007f9459e150 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libnvscf.so
#2 0x0000007f9459f928 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libnvscf.so
#3 0x0000007f944cf538 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libnvscf.so
#4 0x0000007f9451f85c in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libnvscf.so
#5 0x0000007f94530ab8 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libnvscf.so
#6 0x0000007f94514478 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libnvscf.so
#7 0x0000007f944e0c4c in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libnvscf.so
#8 0x0000007f944e0e84 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libnvscf.so
#9 0x0000007f944dfa50 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libnvscf.so
#10 0x0000007f950be628 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libnvos.so
#11 0x0000007f94e2e088 in start_thread (arg=0x7f8f2a12cf) at pthread_create.c:463
#12 0x0000007f952a0ffc in thread_start () at …/sysdeps/unix/sysv/linux/aarch64/clone.S:78

Im attaching the crash dump:
_usr_sbin_nvargus-daemon.0.crash (21.4 MB)

After this the our recovery mechanism kicks in shuts down our pipelines and restarts the nvargus-daemon.

In the nvargus-daemon log I see the following:

may 24 14:46:28 falconprod nvargus-daemon[25496]: === gst-launch-1.0[25701]: Connection closed (7F8F2A21D0)=== gst-launch-1.0[25701]: WARNING: CameraProvider was not destroyed before client connection terminated.=== gst-launch-1.0[25701]: The client may have abnormally terminated. Destroying CameraProvider…=== gst-launch-1.0[25701]: CameraProvider destroyed (0x7f88b5c610)=== gst-launch-1.0[25701]: WARNING: Cleaning up 1 outstanding requests…=== gst-launch-1.0[25701]: WARNING: Cleaning up 1 outstanding streams…SCF: Error InvalidState: 3 buffers still pending during EGLStreamProducer destruction (propagating from src/services/gl/EGLStreamProducer.cpp, function freeBuffers(), line 306)
may 24 14:46:28 falconprod nvargus-daemon[25496]: SCF: Error Disconnected: (propagating from src/services/gl/EGLStreamProducer.cpp, function presentBufferInternal(), line 539)
may 24 14:46:28 falconprod nvargus-daemon[25496]: SCF: Error InvalidState: (propagating from src/services/gl/EGLStreamProducer.cpp, function ~EGLStreamProducer(), line 50)
may 24 14:46:28 falconprod nvargus-daemon[25496]: === gst-launch-1.0[25701]: WARNING: Cleaning up 1 outstanding stream settings…=== gst-launch-1.0[25701]: WARNING: Cleaning up 1 outstanding sessions…(NvCameraUtils) Error InvalidState: Mutex not initialized (/dvs/git/dirty/git-master_linux/camera/core_scf/src/services/gl/EGLStreamProducer.cpp:212) (in Mutex.cpp, function lock(), line 79)
may 24 14:46:28 falconprod systemd[1]: Stopping Argus daemon…

Before this there is nothing nor in the dmesg log.

Do you have any idea what can cause this? Or how could we advance to find out the reason?

Thanks!

Bests,
peter

I would suggest to check if any problem for the sensor driver cause the capture failed cause the daemon go to unknow state cause the issue.

  1. Have v4l2-ctl to confirm if can capture data from sensor continuously.
  2. Disable all of the AE/GAIN/EXPOSURE control function to verify if any one of them cause the capture failed.

Is there any specific v4l2-ctl command I should run? As I see it has a lot options.

I mean for me the most useful would be if I would be able to get live statistic while the application is running (when the issue happens the system reboots itself so my strategy was to start ssh sessions from a different host and run there command like dmesg -w or journalctl -f -u nvargus-daemon and even if the system reboots I would get the logs). As I see v4l2-ctl can access to the deveice even if it is unused by an other process (at least the v4l2-ctl -d0 -D gavem me output :) )

I guess Im not using any AE/GAIN/EXPOSURE control function but can you elaborate what do you mean by that? Or give an example command how can I make sure that Im not using any :D

By the way is there any way to get some meaningful output out of this crashdump? Just for future reference if it happens again I would not waste my time try to figure it out :D

For the v4l2-ctl command you can check this. Have a modify the resolution as your sensor supported.

v4l2-ctl -d /dev/video0 --set-fmt-video=width=1920,height=1080 --set-ctrl bypass_mode=0

For the control function you can check the sensor driver to modify then as dummy function to verify.

Hi

Sorry for the late response, I was trying to narrow down the problem to pinpoint where is the issue is happening (but it take 12 - 18 hours to crash I have like one shot every day).

This is is our pipeline:

gst-launch-1.0 -e nvarguscamerasrc sensor-id=1 sensor-mode=0
! ‘video/x-raw(memory:NVMM), width=(int)1920, height=(int)1080, format=(string)NV12, framerate=(fraction)30/1’
! nvvidconv ! nvivafilter cuda-process=true customer-lib-name=liboverlay_1080p_simle.so
! ‘video/x-raw(memory:NVMM), format=(string)NV12’
! nvvidconv ! nvv4l2vp8enc bitrate=8000000 control-rate=1 ! rtpvp8pay mtu=1400
! udpsink auto-multicast=true clients=127.0.0.1:56100,127.0.0.1:56101

So based on my tests the issue only happens if I run the pipeline with my custom lib. If I run it without the nvivafilter or use the “nvsample_cudaprocess_custom” the issue is not occurs within 24 - 28 hours test period (that was the longest until I concluded a test where no crash happened). So it seems to be connected to my code, but I cannot be 100% since crash was also not happened if I ran the pipeline on 720p or with a Raspberry Pi camera, so it might be connected to some resource in some weird way.
I was not noticing an abnormally in the tegrastast output either when the crash happened:

RAM 1608/3964MB (lfb 209x4MB) SWAP 0/1982MB (cached 0MB) CPU [38%@1479,39%@1479,36%@1479,32%@1479] EMC_FREQ 0% GR3D_FREQ 11% PLL@34.5C CPU@40.5C PMIC@100C GPU@37.5C AO@44C thermal@39C POM_5V_GPU 4290/4088 POM_5V_IN 78/77 POM_5V_CPU 1259/1115
RAM 1608/3964MB (lfb 209x4MB) SWAP 0/1982MB (cached 0MB) CPU [42%@1036,39%@1036,41%@1036,36%@1036] EMC_FREQ 0% GR3D_FREQ 11% PLL@34.5C CPU@40.5C PMIC@100C GPU@37.5C AO@44C thermal@39.5C POM_5V_GPU 3909/4088 POM_5V_IN 78/77 POM_5V_CPU 985/1115
RAM 1607/3964MB (lfb 209x4MB) SWAP 0/1982MB (cached 0MB) CPU [36%@1132,31%@1132,30%@1132,29%@1132] EMC_FREQ 0% GR3D_FREQ 11% PLL@34.5C CPU@40.5C PMIC@100C GPU@37.5C AO@44.5C thermal@39C POM_5V_GPU 3869/4088 POM_5V_IN 78/77 POM_5V_CPU 985/1115
RAM 1611/3964MB (lfb 209x4MB) SWAP 0/1982MB (cached 0MB) CPU [53%@1479,46%@1479,50%@1479,40%@1479] EMC_FREQ 0% GR3D_FREQ 5% PLL@35C CPU@41.5C PMIC@100C GPU@38C AO@44.5C thermal@39.75C POM_5V_GPU 5200/4088 POM_5V_IN 78/77 POM_5V_CPU 2072/1115

I’m attaching my code:
nvivafilter_overlay.tar.gz (10.4 KB)

Its not a beauty :D (its my first ever CUDA code) but I cannot find any clear reason which would lead to such error.

Also Im attaching the dmesg and kern.logs which were printed when the error happened (not sure if those are consequences or root causes of the error or just happened to happen at the same time but you might spot something there)
logs.tar.gz (750.7 KB)

As it is very cumbersome to troubleshoot / debug such error so I was thinking that maybe I am the one who is approaching it a wrong way. Do you have any idea where else should I look or some other way to pinpoint the root cause?

Thank you!