Nvargus-daemon locks up video pipeline on errors while using multiple cameras

This is a continuation from the previous topic

with the focus on fixing nvargus-daemon to be more robust in cleaning up the video pipeline OR dying gracefully and letting systemd restart it when errors occur.

On the latest R32.5.0 and going back all the way to R32.4.3, when a camera error occurs, while two or more cameras are streaming, nvargus-daemon fails to recover by cleaning up the connections. If there is only a single camera streaming when errors occur nvargus-daemon will Segmentation fault and systemd will automatically restart nvargus-daemon and make it useful again. However, this does not happen when two or more cameras are streaming.

Here is a link to an even older topic where camera errors bring down the entire video pipeline and make it unusable:

Please note this topic is not about how to fix the camera streaming errors. This topic is about how to make nvargus-daemon robust enough to recover the video pipeline when errors do occur (and they will in a real system).

In order to reproduce the issue all that is required is to stream with two cameras using Gstreamer or any other application. Once the cameras are streaming force an error condition through any of the following methods:

  • Physically disconnect camera (i.e., SerDes, FPD-Link, GMSL, CSI ribbon cable, etc.)
  • Disconnect camera power (i.e., i2c write or GPIO)
  • Reset image sensor
  • Inject error(s) on the CSI packets
  • Change the image sensor clock
  • Change the image sensor (CSI transmitter) settings to cause an error
  • Et cetera

Any one of these methods should work to cause nvargus-daemon to deadlock and become unusable until it is manually restarted. Again, these are artificial methods to reproduce the issue with nvargus-daemon and the real use case is ESD, electrical noise, vibration, shock, and cosmic rays which will be difficult for you to reproduce.

For current release have error handling sample code Argus/samples/userAutoExposure for it.
Does that not working for you?

No, that does not work. No amount of error handling at the application level fixes the issue with nvargus-daemon deadlocking. The problem is easily reproducible with argus_camera or gst-launch-1.0. I cannot make nvargus-daemon work correctly by adding more error handling detection at the application level. Only modifying nvargus-daemon to better handle errors gracefully will fix the issue.

Here is the output of argus_camera in Multi Session mode:
argus_camera_terminal_output.txt (4.5 KB)
The first run is where the camera error occurs and I force kill the argus_camera application. The second execution shows argus_camera locked up and unable to run because nvargus-daemon is deadlocked and will not allow any further camera operation to work properly.

Here is the output of nvargus-daemon when the error occurs and the very last line of output when it is deadlocked and non-functional:
nvargus-daemon-deadlock-from-argus_camera.txt (16.0 KB)

Here is another deadlock of nvargus-daemon where I added some comments (not part of the output) so you can see some additional messages and actions I took with argus_camera to try and recover (to no avail):
nvargus-daemon-deadlock-from-argus_camera-with-comments.txt (10.5 KB)

Here is another deadlock of nvargus-daemon where I enabled the PCL and SCF logging and included comments:
nvargus-daemon-deadlock-with-extra-logging.txt (3.3 MB)

How can we switch the topic focus to nvargus-daemon and start debugging the deadlock within the NVIDIA camera/ISP daemon?

Hi,
We have error handling improvement in nvargus-daemon from JP4.4 to JP4.5 so would like to make sure whether you receive error status in upper application layer. Looks like the conditions you hit is not handled so no error status is reported. We will try to reproduce the conditions and do investigation.

@DaneLLL, we are using JP4.5. Some of the errors are reported up and seen on the GST bus, however, the nvargus-daemon is locked up so that the gstreamer pipeline cannot be destroyed and no new streaming can occur. The only solution we have found is to kill the nvargus-daemon.

We need a more robust solution for nvargus-daemon so that the application layer can recover and start streaming again.

Thanks,
Cliff

[EDIT]

hello JDSchroeder,

besides physically disconnect camera device,
could you please have a try enable software simulated methods to stop the video stream.
for example,
here’s command to control the stream,
# echo 0 > /sys/kernel/debug/camera-video0/streaming

please have a try to shutdown the stream as above while running argus_camera or gst-launch-1.0 .
[EDIT]
please test with Argus samples, such as argus_userautoexposure , which has error handling implemented.

after you force-stop the stream, you should kill and restart nvargus-daemon service.

$ sudo pkill nvargus-daemon
$ sudo nvargus-daemon

from the software point-of-view,
after you force-stop the camera steam, there will be timeout failures from camera pipeline. Argus will report EVENT_TYPE_ERROR, and then the application has to shutdown gracefully.
hence,
please confirm whether you can restore camera functionality with the software simulated methods,
thanks

Hi @JerryChang ,

I am working with @JDSchroeder on this issue. Before your edit, I did test the procedure with gstreamer pipelines using nvarguscamerasrc.

I had inconsistent results when issuing “echo 0 > /sys/kernel/debug/camera-video0/streaming”.
Sometimes I was able to cleanly stop the gstreamer pipelines and sometimes I was not.

Even when I was able to cleanly stop the pipelines, I was not able to restart any pipelines in our application after issuing:
sudo pkill nvargus-daemon sudo nvargus-daemon

After killing & restarting nvargus-daemon, does the application using the daemon also need to be restarted? It looks like the new launch of the daemon is not seeing anything from the existing application.

Why does the daemon need to be killed and restarted?

We are using the nvidia provided nvarguscamerasrc and need to be able to recover from errors without restarting the daemon and application.

If we try to use argus_userautoexposure, do you expect to be able to reconnect to the argus daemon, or do you think that application will need to be stopped and restarted as well?

Does nvarguscamerasrc need additional error handling or does it have the same error handling as the Argus examples?

Thanks,
Cliff

hello cliff.hofman, JDSchroeder,

it’s nvargus-daemon service running in the background for camera stack process.
if you look into the log, camera stack report timeout errors and nvargus-daemon sometimes stuck there for waiting sensor stream, eventually, nvargus-daemon got InvalidState and it’s unrecoverable.

according to readme file, if there’s an application crash/hang occurs, the nvargus-daemon service may be left in a bad state, and the hardware may be unavailable for a short time afterwards; when this occurs it is best to restart the nvargus-daemon service and wait for about 15 seconds before attempting to run another application.
hence,
please kill and restart nvargus-daemon to try to resolve this on software level.

FYI,
we had two approaches for testing this locally. (1) software simulated methods and (2) physically disconnect camera.
it’s tested with $ argus_userautoexposure -f 1000, generate error status while streaming, and observe the results.

Jetson TX2:
it shows camera functionality will be able back to work for the software simulated methods. we don’t have problem with the simulated fault injection.
however, during physical sensor disconnect, we are sometimes required to reboot the board. nvargus-daemon restart alone does not help.

Jetson Xavier:
camera functionality will be able back to work for both of software simulated and physically disconnect camera.
thanks