Tx2-4g R32.3.1 nvargus-daemon does not restart 100% of the time

That does not fix or address the problem in any way. The problem is in nvargus-daemon and the improper handling of error events/conditions. Please look through the log file to see the error and where nvargus-daemon locks up.

Here is the log from nvargus-daemon with libgstnvarguscamerasrc.so rebuilt with the 5 second timeout:
nvargus-daemon-deadlock-w-extended-timeout-in-nvarguscamerasrc-plugin.txt (9.4 KB)

Timeouts and errors are inevitable in a real production environment. nvargus-daemon needs to be robust enough to handle errors and properly be able to restart itself OR exit and let systemd restart it. Neither is happening right now when more than one camera is being streamed and an error occurs.

The goal is not to fix or eliminate the timeout or other error conditions. The goal is to fix nvargus-daemon to handle the error conditions in a more graceful manner.

This problem is also affecting my product using L4T R32.5.0. There needs to be a way for nvargus-daemon to restart properly in order to have a viable consumer product. Similar to JDSchroeder, with multiple cameras streaming, nvargus-daemon will lock-up and not cleanup properly.

Please help to solve this problem as it is affecting our ability to release a product based on TX2.

The timeout solution might help in my case, but in typical Nvidia fashion, the solution is terse, and we the users need cookbook type solutions.

I downloaded the public_sources they pointed at, had to search for the gzip file that contained the nvarguscamerasrc source, and then downloaded it to my Tx2. Read the README and found I needed the jetson_multimedia_api, the make failed looking for <Argus/Argus.h>, and after I found a multimedia_api to install, I am getting compile errors in the gstnvarguscamerasrc.cpp

gstnvarguscamerasrc.cpp: In member function ‘virtual bool ArgusCamera::StreamConsumer::threadExecute(GstNvArgusCameraSrc*)’:
gstnvarguscamerasrc.cpp:302:5: error: reference to ‘Status’ is ambiguous
Status frame_status;
^~~~~~
In file included from /usr/src/jetson_multimedia_api/include/Argus/Argus.h:116:0,
from gstnvarguscamerasrc.cpp:47:
/usr/src/jetson_multimedia_api/include/Argus/Types.h:52:13: note: candidates are: typedef int Status
typedef int Status;
^~~~~~
/usr/src/jetson_multimedia_api/include/Argus/Types.h:93:6: note: enum Argus::Status
enum Status
^~~~~~
gstnvarguscamerasrc.cpp:333:7: error: reference to ‘Status’ is ambiguous
Status argusStatus = iEventError->getStatus();

Looks like the api and this version of the code are not compatible, so I am DEAD IN THE WATER, waiting on Nvidia to provide more info.

Terry - Still waiting for a solution and it has been over a year of asking for help!

Hi,
Please share information about your camera modules. If it is from our camera partners, we can work with partners to do further investigation.

Lumenera,

Lumenera tested my gstreamer app on a nvidia camera and got the same problems, as they did on their driver!

Why no information about my make problems?

Terry

You may try this patch (it’s for R32.5, but should be easy to try on R32.3).

Thank Honey_Patouceul for sharing the patch. It should fix ambiguous Argus::Status.

Thanks @DaneLLL @Honey_Patouceul nvgstarguscamerasrc.cpp now compiles.

How are customers expected to be able to find this patch? Or am I just lucky that Honey has a great memory about something that was seen in another, post.

Customers like my self do not have time to read all topics,

Nvidia needs a better database/way for customers to find these facts.

Thanks,
Terry

now it compiles but I get
/usr/bin/ld: cannot find -lnvdsbufferpool

I have install jetson_multimedia_api, and it does not compile, remember I am on r32.3.1 not r32.5

Terry

I see :

locate nvdsbufferpool
/opt/nvidia/deepstream/deepstream-5.0/sources/includes/gstnvdsbufferpool.h
/usr/lib/aarch64-linux-gnu/tegra/libnvdsbufferpool.so
/usr/lib/aarch64-linux-gnu/tegra/libnvdsbufferpool.so.1.0.0

Hi,
We collect known issues in
Jetson/L4T/r32.5.x patches - eLinux.org
Please take a look.

For error about nvdsbufferpool, please install DeepStream SDK and try again.

None of those patches address the issue in nvargus-daemon. nvargus-daemon locks up whenever there are more than one camera streaming and one of them has an error (i.e., fence timeout, CSI framing error, etc.). nvargus-daemon needs to automatically restart itself or die and let systemd restart it when a camera has an error. The cleanup/error handling code in nvargus-daemon needs to be looked at and addressed to eliminate the deadlock.

If you can’t eliminate the deadlocks in nvargus-daemon when errors occur, then you need to modify nvargus-daemon to tie it in with the built-in systemd watchdog support. That way if/when nvargus-daemon deadlocks and stops kicking the watchdog, systemd will automatically restart it after the specified timeout. You can read more about systemd including the watchdog support here: systemd.service

Hi @JDSchroeder
Could you share information about the camera, vendor and model ID. From the log, it looks like CSI cannot catch correct FE:

A FS packet was found for a virtual channel that was already in frame. An errored FE packet was injected before FS was allowed through.
captureErrorCallback Stream 0.0 capture 10869 failed: ts 9212873796096 frame 278 error 2 data 0x000000a0

SCF: Error Timeout: ISP port 0 timed out! (in src/services/capture/NvIspHw.cpp, function waitIspFrameEnd(), line 478)

If the vendor is our camera partner, we can work with them to debug why frame end signal cannot be correctly captured.

So how do I determine what jetpack I am on, I am using r32.3.1 on a tx2-4g

I tried to download deepstream-4.0_4.0-1_arm64.deb and install it on my tx2, got

sudo apt install /tmp/deepstream-4.0_4.0-1_arm64.deb
[sudo] password for tbuckley:
Reading package lists… Done
Building dependency tree
Reading state information… Done
Note, selecting ‘deepstream-4.0’ instead of ‘/tmp/deepstream-4.0_4.0-1_arm64.deb’
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
deepstream-4.0 : Depends: libnvinfer5 (>= 5.1.2) but it is not installable
E: Unable to correct problems, you have held broken packages.
tbuckley@BaseSystem_0_5:/tmp$ sudo apt-get install libnvinfer5
Reading package lists… Done
Building dependency tree
Reading state information… Done
E: Unable to locate package libnvinfer5

So I am dead in the water,

What next.
Terry

The issue with nvargus-daemon can be reproduced with any camera and is not vendor or configuration specific. The main requirement is at least two cameras streaming simultaneously over the CSI bus.

We have reproduced the nvargus-daemon lock-up issue with several cameras. One particular camera is the Leopard Imaging AR0231 FPD-Link III camera (LI-JEVA-AR0231-FPDLINKIII). The AR0231 is a 1928x1208 30FPS camera. Having two of these stream with Gstreamer to a fakesink is enough to reproduce the issue. We have also reproduced the issue with two other image sensors and camera modules. The problem has nothing to do with a given image sensor. The problem is in the handling of errors within nvargus-daemon. A design can never prevent every sort of error from occurring. However, nvargus-daemon should not deadlock the video pipeline. Even if it cannot properly handle an error condition it should gracefully restart itself instead of deadlocking the video pipeline system and preventing future restarts of Gstreamer from ever working.

Here are the steps to reproduce:

  1. Stream two or more cameras (CSI based sensor) using Gstreamer
    gst-launch-1.0 -v nvarguscamerasrc sensor-id=0 sensor-mode=0 ! 'video/x-raw(memory:NVMM),width=1920,height=1080,format=NV12' ! fakesink nvarguscamerasrc sensor-id=1 sensor-mode=0 ! 'video/x-raw(memory:NVMM),width=1920,height=1080,format=NV12' ! fakesink
  2. Disconnect either camera while streaming

These steps will cause nvargus-daemon to deadlock and prevent any future execution of gst-launch-1.0 from working properly, until nvargus-daemon is manually restarted.

There’s no need to debug the frame end signal not being correctly captured - bad stuff happens all the time in the real world. The occasional problem in the the CSI communication is expected. The deadlock of nvargus-dameon is the problem. No amount of work to fix the CSI communication will fix the unfortunate event where an error condition does occur, and nvargus-daemon locks up. How can we fix nvargus-daemon to handle the “frame end signal” and not deadlock the video pipeline with two or more cameras streaming simultaneously?

If you need me to run a special build of nvargus-daemon to instrument out more data and let you know exactly where it is deadlocking and failing to restart, I can certainly do that. This will help you fix the issue in nvargus-daemon so that it can restart itself entirely or properly cleanup resources so that the video pipeline can remain functional for future launches of Gstreamer.

Not sure Jetson CSI supports this by default. I may be wrong, but I don’t think it is hot pluggable, although it might work in some cases.

We’re experiencing identical issues. Trying to run multiple cameras, we are encountering camera blackouts and occasional crashes, identical to what Terry has been seeing, and we’re using completely different cameras from Vision Components.

It is not an intended use case. There is no desire to hot plug the camera. This is a way to reproduce the issue easily for any camera system that has a cable going between the camera and the board (i.e., SerDes, FPD-Link, GMSL, CSI ribbon cable, etc.). The actual real scenario is ESD, electrical noise, vibration, shock, and cosmic rays.

Here are some other ways you can reproduce the issue without physically disconnecting the camera while streaming:

  • Disconnect camera power (i.e., i2c write or GPIO)
  • Reset image sensor
  • Inject error(s) on the CSI packets
  • Change the image sensor clock
  • Change the image sensor (CSI transmitter) settings to cause an error
  • Et cetera

Hi,
On r32.5.1, we have error handling code in nvarguscamerasrc:

    if (iEvent->getEventType() == EVENT_TYPE_ERROR)
    {
      if (src->stop_requested == TRUE)
        break;

      src->argus_in_error = TRUE;
      const IEventError* iEventError = interface_cast<const IEventError>(event);
      Status argusStatus = iEventError->getStatus();
      error = g_error_new (domain, argusStatus, getStatusString(argusStatus));
      GstMessage *message = gst_message_new_error (GST_OBJECT(src), error, "Argus Error Status");
      gst_element_post_message (GST_ELEMENT_CAST(src), message);
      g_mutex_lock (&src->argus_buffers_queue_lock);
      src->stop_requested = TRUE;
      g_mutex_unlock (&src->argus_buffers_queue_lock);
      break;
    }

When the camera board is removed, you should receiver Argus error status and can exit the application. Or in certain cases, no error is reported from Argus?

hello JDSchroeder,

had you tried with Argus sample applications to reproduce the same?
for example, Argus/samples/userAutoExposure

here’s an FYI,
camera stack by design has a single CaptureService thread to submit request for all captures on all active sessions.

for example,
in your dual camera use-case, let’s say cam-A and cam-B.
camera stack expect all sensors (i.e. both cam-A and cam-B) should have same start-up time. if cam-A has a delayed frame, it would impact the capture submission on other sessions, cam-B.
you could have a WAR for the slower sensor to be triggered first to buy some time for another camera sensor.