Nvargus-daemon locks up video pipeline on errors while using multiple cameras

JDSchroeder · April 27, 2021, 9:58pm

This is a continuation from the previous topic

with the focus on fixing nvargus-daemon to be more robust in cleaning up the video pipeline OR dying gracefully and letting systemd restart it when errors occur.

On the latest R32.5.0 and going back all the way to R32.4.3, when a camera error occurs, while two or more cameras are streaming, nvargus-daemon fails to recover by cleaning up the connections. If there is only a single camera streaming when errors occur nvargus-daemon will Segmentation fault and systemd will automatically restart nvargus-daemon and make it useful again. However, this does not happen when two or more cameras are streaming.

Here is a link to an even older topic where camera errors bring down the entire video pipeline and make it unusable:

Please note this topic is not about how to fix the camera streaming errors. This topic is about how to make nvargus-daemon robust enough to recover the video pipeline when errors do occur (and they will in a real system).

In order to reproduce the issue all that is required is to stream with two cameras using Gstreamer or any other application. Once the cameras are streaming force an error condition through any of the following methods:

Physically disconnect camera (i.e., SerDes, FPD-Link, GMSL, CSI ribbon cable, etc.)
Disconnect camera power (i.e., i2c write or GPIO)
Reset image sensor
Inject error(s) on the CSI packets
Change the image sensor clock
Change the image sensor (CSI transmitter) settings to cause an error
Et cetera

Any one of these methods should work to cause nvargus-daemon to deadlock and become unusable until it is manually restarted. Again, these are artificial methods to reproduce the issue with nvargus-daemon and the real use case is ESD, electrical noise, vibration, shock, and cosmic rays which will be difficult for you to reproduce.

ShaneCCC · April 28, 2021, 4:03am

For current release have error handling sample code Argus/samples/userAutoExposure for it.
Does that not working for you?

JDSchroeder · April 28, 2021, 4:14pm

No, that does not work. No amount of error handling at the application level fixes the issue with nvargus-daemon deadlocking. The problem is easily reproducible with argus_camera or gst-launch-1.0. I cannot make nvargus-daemon work correctly by adding more error handling detection at the application level. Only modifying nvargus-daemon to better handle errors gracefully will fix the issue.

Here is the output of argus_camera in Multi Session mode:
argus_camera_terminal_output.txt (4.5 KB)
The first run is where the camera error occurs and I force kill the argus_camera application. The second execution shows argus_camera locked up and unable to run because nvargus-daemon is deadlocked and will not allow any further camera operation to work properly.

Here is the output of nvargus-daemon when the error occurs and the very last line of output when it is deadlocked and non-functional:
nvargus-daemon-deadlock-from-argus_camera.txt (16.0 KB)

Here is another deadlock of nvargus-daemon where I added some comments (not part of the output) so you can see some additional messages and actions I took with argus_camera to try and recover (to no avail):
nvargus-daemon-deadlock-from-argus_camera-with-comments.txt (10.5 KB)

Here is another deadlock of nvargus-daemon where I enabled the PCL and SCF logging and included comments:
nvargus-daemon-deadlock-with-extra-logging.txt (3.3 MB)

How can we switch the topic focus to nvargus-daemon and start debugging the deadlock within the NVIDIA camera/ISP daemon?

DaneLLL · April 29, 2021, 3:09am

Hi,
We have error handling improvement in nvargus-daemon from JP4.4 to JP4.5 so would like to make sure whether you receive error status in upper application layer. Looks like the conditions you hit is not handled so no error status is reported. We will try to reproduce the conditions and do investigation.

cliff.hofman · April 30, 2021, 7:49pm

@DaneLLL, we are using JP4.5. Some of the errors are reported up and seen on the GST bus, however, the nvargus-daemon is locked up so that the gstreamer pipeline cannot be destroyed and no new streaming can occur. The only solution we have found is to kill the nvargus-daemon.

We need a more robust solution for nvargus-daemon so that the application layer can recover and start streaming again.

Thanks,
Cliff

JerryChang · May 3, 2021, 3:15am

[EDIT]

hello JDSchroeder,

besides physically disconnect camera device,
could you please have a try enable software simulated methods to stop the video stream.
for example,
here’s command to control the stream,
# echo 0 > /sys/kernel/debug/camera-video0/streaming

please have a try to shutdown the stream as above while running argus_camera or gst-launch-1.0 .
[EDIT]
please test with Argus samples, such as argus_userautoexposure , which has error handling implemented.

after you force-stop the stream, you should kill and restart nvargus-daemon service.

$ sudo pkill nvargus-daemon
$ sudo nvargus-daemon

from the software point-of-view,
after you force-stop the camera steam, there will be timeout failures from camera pipeline. Argus will report EVENT_TYPE_ERROR, and then the application has to shutdown gracefully.
hence,
please confirm whether you can restore camera functionality with the software simulated methods,
thanks

cliff.hofman · May 3, 2021, 1:34pm

Hi @JerryChang ,

I am working with @JDSchroeder on this issue. Before your edit, I did test the procedure with gstreamer pipelines using nvarguscamerasrc.

I had inconsistent results when issuing “echo 0 > /sys/kernel/debug/camera-video0/streaming”.
Sometimes I was able to cleanly stop the gstreamer pipelines and sometimes I was not.

Even when I was able to cleanly stop the pipelines, I was not able to restart any pipelines in our application after issuing:
$ sudo pkill nvargus-daemon
$ sudo nvargus-daemon

After killing & restarting nvargus-daemon, does the application using the daemon also need to be restarted? It looks like the new launch of the daemon is not seeing anything from the existing application.

Why does the daemon need to be killed and restarted?

We are using the nvidia provided nvarguscamerasrc and need to be able to recover from errors without restarting the daemon and application.

If we try to use argus_userautoexposure, do you expect to be able to reconnect to the argus daemon, or do you think that application will need to be stopped and restarted as well?

Does nvarguscamerasrc need additional error handling or does it have the same error handling as the Argus examples?

Thanks,
Cliff

JerryChang · May 4, 2021, 6:05am

hello cliff.hofman, JDSchroeder,

it’s nvargus-daemon service running in the background for camera stack process.
if you look into the log, camera stack report timeout errors and nvargus-daemon sometimes stuck there for waiting sensor stream, eventually, nvargus-daemon got InvalidState and it’s unrecoverable.

according to readme file, if there’s an application crash/hang occurs, the nvargus-daemon service may be left in a bad state, and the hardware may be unavailable for a short time afterwards; when this occurs it is best to restart the nvargus-daemon service and wait for about 15 seconds before attempting to run another application.
hence,
please kill and restart nvargus-daemon to try to resolve this on software level.

FYI,
we had two approaches for testing this locally. (1) software simulated methods and (2) physically disconnect camera.
it’s tested with $ argus_userautoexposure -f 1000, generate error status while streaming, and observe the results.

Jetson TX2:
it shows camera functionality will be able back to work for the software simulated methods. we don’t have problem with the simulated fault injection.
however, during physical sensor disconnect, we are sometimes required to reboot the board. nvargus-daemon restart alone does not help.

Jetson Xavier:
camera functionality will be able back to work for both of software simulated and physically disconnect camera.
thanks

Topic		Replies	Views
Tx2-4g R32.3.1 nvargus-daemon does not restart 100% of the time Jetson TX2 camera , gstreamer	46	4931	October 18, 2021
Nvarguscamerasrc stops recording - possible deadlock Jetson Nano camera	4	348	March 20, 2024
JP5.1 nvarguscamera doesn't recover from single NVCSI failure Jetson AGX Xavier camera , nvbugs	51	4477	July 18, 2023
Custom Deepstream App Segmentation Fault with nvarguscamerasrc Camera Disconnect Jetson Xavier NX camera	7	360	February 22, 2024
Nvargus-daemon crashing during recording Jetson TX2 camera , gstreamer	18	163	December 17, 2024
Nvargus-daemon crash Jetson Nano camera , gstreamer	7	2855	October 15, 2021
BUG: nvarguscamerasrc Segmentation fault Jetson Nano	15	3041	October 14, 2021
Nvargus-daemon crashes when 4 camera 4k@60 capture pipeline stops on AGX Xavier JP4.5.1 Jetson Xavier NX camera , gstreamer , nvbugs	11	1419	October 18, 2021
R32.3.1.tx2-4g gstream app fails to run nvargus_daemon message Jetson TX2 camera	29	1690	October 18, 2021
Nvarguscamerasrc Timeout error Jetson AGX Xavier camera , gstreamer	11	1253	March 29, 2023

Nvargus-daemon locks up video pipeline on errors while using multiple cameras

Related topics