Nvargus crashes with unreliable CSI camera connections on Jetpack 5.1.2

jbl · December 4, 2023, 3:08pm

Hello everyone,

I am experiencing issues with an IMX477-based CSI camera (Raspberry Pi HQ camera) connected to a Jetson Orin NX 16GB on an AverMedia D131 carrier board. Specifically, it seems like upon a single failure on the CSI line, argus sends an end of stream signal instead of just dropping the frame. A similar problem has been described here and here and in other threads in this forum as well, and even though it seems like NVidia is aware of these issues, as is evident by recent nvargus patches mentioning camera stability, this problem seems to still be persistent. Specifically, I notice the following error in the dmesg log:


[ 541.149641] [RCE] ERROR: camera-ip/vi5/vi5.c:745 [vi5_handle_eof] "General error queue is out of sync with frame queue. ts=555659644352 sof_ts=555683971776 gerror_code=2 gerror_data=400063 notify_bits=0"

Attached, you find my detailed dmesg log as well as a trace as instructed here.

logs.zip (723.8 KB)

I found that upon such a failure, gstreamer sends an EOS (end of stream) signal, which typically results in the camera stream stopping. Restarting my camera driver resolves the issue, but this usually results in a camera downtime of a few seconds. This issue can be reproduced with the gstreamer viewer:


gst-launch-1.0 -e nvarguscamerasrc sensor-id=0 ! 'video/x-raw(memory:NVMM),width=1920,height=1080,framerate=60/1' ! queue ! nvvidconv ! fpsdisplaysink video-sink='xvimagesink' sync=false

I used the exact same cameras with the same cables for two years continuously on Raspberry Pis, and never experienced such issues.

It does seem like this problem is typically triggered by hardware faults. The authors in the linked threads have explained that the problem is reproducible by shorting D+ and D- lines, but the engineers did not seem to understand why that is a problem. I would like to emphasise the importance of a camera driver being able to recover from such issue - of course, shorting data lines is an extreme corner case, but flimsy connections are a reality in any real world robotics application. I managed to fix the severity of my problems by changing the connection, but with rapid movements of my mobile robot, vibrations can still cause brief data lines disruptions, or EMC effects might cause similar issues. In those cases, argus cannot just stop operating, it has to be able to detect errors and should just drop the frame instead of emitting EOS.

Is there a way to fix this? Please let me know if you need any more data from me.

Regards,

Jan

JerryChang · December 5, 2023, 5:35am

hello jbl,

may I know which Jetpack release version you’re using?
according to Release Notes (r35.4.1), there’s improvement for error resiliency.
re-cap as below…

Enhanced error resiliency for improved stability in Argus.

hence,
could you please moving to the latest release version, (i.e. Jetpack-5.1.2/ l4t-r35.4.1) for confirmation.

yaniv.nahum · December 5, 2023, 9:02am

I have the same issue with IMX477-based CSI (arducam) on Orin NX 8GB Jetpack-5.1.2/ l4t-r35.4.1.
It happens randomly on some devices after a few seconds or minutes or more.
same dmesg error:

[  314.359535] [RCE] ERROR: camera-ip/vi5/vi5.c:745 [vi5_handle_eof] "General error queue is out of sync with frame queue. ts=329522895200 sof_ts=329563896608 gerror_code=2 gerror_data=600062 notify_bits=0"

I did try to boost clocks with no success:

echo 1 > /sys/kernel/debug/bpmp/debug/clk/vi/mrq_rate_locked
echo 1 > /sys/kernel/debug/bpmp/debug/clk/isp/mrq_rate_locked
echo 1 > /sys/kernel/debug/bpmp/debug/clk/nvcsi/mrq_rate_locked
echo 1 > /sys/kernel/debug/bpmp/debug/clk/emc/mrq_rate_locked
cat /sys/kernel/debug/bpmp/debug/clk/vi/max_rate |tee /sys/kernel/debug/bpmp/debug/clk/vi/rate
cat /sys/kernel/debug/bpmp/debug/clk/isp/max_rate | tee /sys/kernel/debug/bpmp/debug/clk/isp/rate
cat /sys/kernel/debug/bpmp/debug/clk/nvcsi/max_rate | tee /sys/kernel/debug/bpmp/debug/clk/nvcsi/rate
cat /sys/kernel/debug/bpmp/debug/clk/emc/max_rate | tee /sys/kernel/debug/bpmp/debug/clk/emc/rate

replacing libnvargus.so (Argus pipeline randomly gets error - #4 by JerryChang) didn’t helped getting errors on syslog:

Nov 26 14:00:17 baseline-cam nvargus-daemon: Module_id 30 Severity 2 : (fusa) Error: InvalidState Status syncpoint signaled but status value not updated in:/capture/src/fusaViHandler.cpp 817
Nov 26 14:00:17 baseline-cam nvargus-daemon: Module_id 30 Severity 2 : (fusa) Error: InvalidState  propagating from:/capture/src/fusaViHandler.cpp 759
Nov 26 14:00:17 baseline-cam nvargus-daemon[20147]: SCF: Error InvalidState:  Corr Error Received for sensor 1 .. Continuing!
Nov 26 14:00:17 baseline-cam nvargus-daemon[20147]:  (in src/services/capture/FusaCaptureViCsiHw.cpp, function waitCsiFrameEnd(), line 643)
Nov 26 14:00:17 baseline-cam nvargus-daemon: Module_id 30 Severity 2 : (fusa) Error: ResourceAlreadyInUse All captures are already pending, no idle captures available in:/capture/src/fusaViHandler.cpp 633
Nov 26 14:00:17 baseline-cam nvargus-daemon: Module_id 30 Severity 2 : (fusa) Error: ResourceAlreadyInUse  propagating from:/capture/src/fusaViHandler.cpp 475
Nov 26 14:00:17 baseline-cam nvargus-daemon[20147]: SCF: Error ResourceAlreadyInUse:  (propagating from src/services/capture/FusaCaptureViCsiHw.cpp, function startCaptureInternal(), line 866)
Nov 26 14:00:17 baseline-cam nvargus-daemon[20147]: SCF: Error ResourceAlreadyInUse:  (propagating from src/services/capture/CaptureRecord.cpp, function doCSItoMemCapture(), line 536)
Nov 26 14:00:17 baseline-cam nvargus-daemon[20147]: SCF: Error ResourceAlreadyInUse:  (propagating from src/services/capture/CaptureRecord.cpp, function issueCapture(), line 483)
Nov 26 14:00:17 baseline-cam nvargus-daemon[20147]: SCF: Error ResourceAlreadyInUse:  (propagating from src/services/capture/CaptureServiceDevice.cpp, function issueCaptures(), line 1530)
Nov 26 14:00:17 baseline-cam nvargus-daemon[20147]: SCF: Error ResourceAlreadyInUse:  (propagating from src/services/capture/CaptureServiceDevice.cpp, function issueCaptures(), line 1359)
Nov 26 14:00:17 baseline-cam nvargus-daemon[20147]: SCF: Error ResourceAlreadyInUse:  (propagating from src/common/Utils.cpp, function workerThread(), line 114)
Nov 26 14:00:17 baseline-cam nvargus-daemon[20147]: SCF: Error ResourceAlreadyInUse: Worker thread CaptureScheduler frameStart failed (in src/common/Utils.cpp, function workerThread(), line 133)
Nov 26 14:00:17 baseline-cam nvargus-daemon[20147]: SCF: Error Timeout:  (propagating from src/api/Buffer.cpp, function waitForUnlock(), line 644)
Nov 26 14:00:17 baseline-cam nvargus-daemon[20147]: SCF: Error Timeout:  (propagating from src/components/CaptureContainerImpl.cpp, function returnBuffer(), line 426)
Nov 26 14:00:17 baseline-cam nvargus-daemon[20147]: SCF: Error InvalidState: Capture Scheduler not running (in src/services/capture/CaptureServiceDevice.cpp, function addNewItemToSchedule(), line 1004)
Nov 26 14:00:17 baseline-cam nvargus-daemon[20147]: SCF: Error InvalidState:  (propagating from src/services/capture/CaptureService.cpp, function addRequest(), line 411)
Nov 26 14:00:17 baseline-cam nvargus-daemon[20147]: SCF: Error InvalidState:  (propagating from src/components/stages/MemoryToISPCaptureStage.cpp, function doHandleRequest(), line 144)
Nov 26 14:00:17 baseline-cam nvargus-daemon[20147]: SCF: Error InvalidState:  (propagating from src/components/stages/OrderedStage.cpp, function doExecute(), line 158)
Nov 26 14:00:17 baseline-cam nvargus-daemon[20147]: SCF: Error InvalidState: Sending critical error event for Session 0
Nov 26 14:00:17 baseline-cam nvargus-daemon[20147]:  (in src/api/Session.cpp, function sendErrorEvent(), line 1039)
Nov 26 14:00:17 baseline-cam wpa_supplicant[885]: wlan0: CTRL-EVENT-SCAN-FAILED ret=-95 retry=1
Nov 26 14:00:18 baseline-cam nvargus-daemon[20147]: SCF: Error Timeout:  (propagating from src/components/amr/Snapshot.cpp, function waitForNewerSample(), line 91)
Nov 26 14:00:18 baseline-cam nvargus-daemon[20147]: SCF_AutocontrolACSync failed to wait for an earlier frame to complete.
Nov 26 14:00:18 baseline-cam nvargus-daemon[20147]: SCF: Error Timeout:  (propagating from src/components/ac_stages/ACSynchronizeStage.cpp, function doHandleRequest(), line 126)
Nov 26 14:00:18 baseline-cam nvargus-daemon[20147]: SCF: Error Timeout:  (propagating from src/components/stages/OrderedStage.cpp, function doExecute(), line 137)
Nov 26 14:00:18 baseline-cam nvargus-daemon[20147]: SCF: Error Timeout: Sending critical error event for Session 1
Nov 26 14:00:18 baseline-cam nvargus-daemon[20147]:  (in src/api/Session.cpp, function sendErrorEvent(), line 1039)
Nov 26 14:00:18 baseline-cam wpa_supplicant[885]: wlan0: CTRL-EVENT-SCAN-FAILED ret=-95 retry=1
Nov 26 14:00:25 baseline-cam wpa_supplicant[885]: message repeated 7 times: [ wlan0: CTRL-EVENT-SCAN-FAILED ret=-95 retry=1]
Nov 26 14:00:26 baseline-cam nvargus-daemon[20147]: SCF: Error Timeout:  (propagating from src/services/capture/CaptureServiceEvent.cpp, function wait(), line 59)
Nov 26 14:00:26 baseline-cam nvargus-daemon[20147]: Error: Camera HwEvents wait, this may indicate a hardware timeout occured,abort current/incoming cc for sensor guid 0 count -2078883072
Nov 26 14:00:26 baseline-cam nvargus-daemon[20147]: SCF: Error Timeout:  (propagating from src/services/capture/CaptureServiceEvent.cpp, function wait(), line 59)
Nov 26 14:00:26 baseline-cam nvargus-daemon[20147]: Error: Camera HwEvents wait, this may indicate a hardware timeout occured,abort current/incoming cc for sensor guid 1 count -2078883072
Nov 26 14:00:26 baseline-cam wpa_supplicant[885]: wlan0: CTRL-EVENT-SCAN-FAILED ret=-95 retry=1
Nov 26 14:00:37 baseline-cam wpa_supplicant[885]: message repeated 11 times: [ wlan0: CTRL-EVENT-SCAN-FAILED ret=-95 retry=1]
Nov 26 14:00:37 baseline-cam nvargus-daemon[20147]: SCF: Error InvalidState: 6 buffers still pending during EGLStreamProducer destruction (in src/services/gl/EGLStreamProducer.cpp, function freeBuffers(), line 300)
Nov 26 14:00:38 baseline-cam wpa_supplicant[885]: wlan0: CTRL-EVENT-SCAN-FAILED ret=-95 retry=1
Nov 26 14:00:42 baseline-cam wpa_supplicant[885]: message repeated 4 times: [ wlan0: CTRL-EVENT-SCAN-FAILED ret=-95 retry=1]
Nov 26 14:00:42 baseline-cam nvargus-daemon[20147]: waitForIdleLocked remaining request 132672
Nov 26 14:00:42 baseline-cam nvargus-daemon[20147]: waitForIdleLocked remaining request 132669
Nov 26 14:00:42 baseline-cam nvargus-daemon[20147]: waitForIdleLocked remaining request 132668
Nov 26 14:00:42 baseline-cam nvargus-daemon[20147]: waitForIdleLocked remaining request 132667
Nov 26 14:00:42 baseline-cam nvargus-daemon[20147]: waitForIdleLocked remaining request 132666
Nov 26 14:00:42 baseline-cam nvargus-daemon[20147]: waitForIdleLocked remaining request 132665
Nov 26 14:00:42 baseline-cam nvargus-daemon[20147]: SCF: Error Timeout: waitForIdle() timed out (in src/api/Session.cpp, function waitForIdleLocked(), line 969)

what else can I try?
Thanks,
Yaniv

jbl · December 5, 2023, 9:59am

I should have specified, I am using the latest D131 BSP D131OX-R2.3.1.5.1.2 which uses L4T 35.4.1 (Jetpack 5.1.2). Similar to Yaniv, I also tried boosting the clocks with no success.

sarperyurttas36 · December 5, 2023, 3:22pm

Hi everyone,
I have the same issue as well with IMX477 (arducam) on Orin NX 16GB Jetpack 5.1.1(L4T 35.3.1). I use the nvarguscamerasrc gstreamer plugin in C++ and mostly I record 2 hours of videos, I have to restart the application every time from another process when it raises an error. It does not happen too often.
The only possible solution seems to use LibArgus API to capture frames and reinitialize CameraProvider after an error, I haven’t tried it yet.

JerryChang · December 6, 2023, 2:20am

hi all,

could you please share your test pipelines.
let us reproduce the issue locally with… Orin NX 8GB/ Jetpack-5.1.2/ Raspberry pi HQ camera, imx477.

jbl · December 6, 2023, 2:44am

Thanks a lot for looking into this!

For me, the issue is reproducible with a fresh installation of Jetpack when running the standard gstreamer pipeline as outlined in my initial post.

The issues are more or less severe depending on the cable I use. Some of my cables are more worn out than others, for some the issue occurs quite quickly (after less than a minute, reproducibly), but for some other cables it may take an hour. This could be reproduced by manually wearing out a cable (perhaps multiple cycles of fixing and removing it from the camera), and by flexing it.

JerryChang · December 6, 2023, 3:07am

this error log looks suspicious.
this meant it is a correctable error, this is just reporting a log, and camera stack it’s still processing frames.
however, if the broken frames (or, intermittently signaling) keep sending to camera stack. it’s capture engine to report timeout errors.

so…
let’s narrow down whether this is hardware issue. as you said it depends-on the cable worn condition.

jbl · December 6, 2023, 2:10pm

Just looked at my system log and I get similar error messages when the crash occurs:

-- Logs begin at Wed 2023-12-06 01:45:33 UTC, end at Wed 2023-12-06 14:03:54 UTC. --
Dec 06 14:01:57 robomaster-1 nvargus-daemon[1224]: Module_id 30 Severity 2 : (fusa) Error: Timeout  propagating from:/capture/src/fusaViHandler.cpp 776
Dec 06 14:01:57 robomaster-1 nvargus-daemon[1224]: === video-viewer[3775]: CameraProvider destroyed (0xffff0c00fea0)=== video-viewer[3775]: Connection closed (FFFF709EB900)=== video-viewer[3775]: Connection cleaned up (FFFF709EB900)=== video-viewer[9933]: Connection established (FFFF709EB900)=== video-viewer[9933]:>
Dec 06 14:01:57 robomaster-1 nvargus-daemon[1224]:  (in src/api/Session.cpp, function sendErrorEvent(), line 1039)
Dec 06 14:01:57 robomaster-1 nvargus-daemon[1224]: SCF: Error InvalidState: Sensor 0 already in same state
Dec 06 14:01:57 robomaster-1 nvargus-daemon[1224]:  (in src/services/capture/CaptureServiceDeviceSensor.cpp, function setErrorState(), line 100)
Dec 06 14:01:57 robomaster-1 nvargus-daemon[1224]: SCF: Error InvalidState: Timeout!! Skipping requests on sensor GUID 0, capture sequence ID = 281470681744741 draining session frameEnd events 281470681743362
Dec 06 14:01:57 robomaster-1 nvargus-daemon[1224]:  (in src/services/capture/FusaCaptureViCsiHw.cpp, function waitCsiFrameEnd(), line 635)
Dec 06 14:01:57 robomaster-1 nvargus-daemon[1224]: SCF: Error InvalidState: Sensor 0 already in same state
Dec 06 14:01:57 robomaster-1 nvargus-daemon[1224]:  (in src/services/capture/CaptureServiceDeviceSensor.cpp, function setErrorState(), line 100)
Dec 06 14:01:57 robomaster-1 nvargus-daemon[1224]: SCF: Error InvalidState: Timeout!! Skipping requests on sensor GUID 0, capture sequence ID = 281470681744742 draining session frameEnd events 281470681743361
Dec 06 14:01:57 robomaster-1 nvargus-daemon[1224]:  (in src/services/capture/FusaCaptureViCsiHw.cpp, function waitCsiFrameEnd(), line 635)
Dec 06 14:01:57 robomaster-1 nvargus-daemon[1224]: SCF: Error InvalidState: Session has suffered a critical failure (in src/api/Session.cpp, function capture(), line 734)
Dec 06 14:01:57 robomaster-1 nvargus-daemon[1224]: (Argus) Error InvalidState:  (propagating from src/api/ScfCaptureThread.cpp, function run(), line 110)
Dec 06 14:01:57 robomaster-1 nvargus-daemon[1224]: SCF: Error InvalidState: Session has suffered a critical failure (in src/api/Session.cpp, function capture(), line 734)
Dec 06 14:01:57 robomaster-1 nvargus-daemon[1224]: (Argus) Error InvalidState:  (propagating from src/api/ScfCaptureThread.cpp, function run(), line 110)
Dec 06 14:01:57 robomaster-1 nvargus-daemon[1224]: SCF: Error InvalidState: Session has suffered a critical failure (in src/api/Session.cpp, function capture(), line 734)
Dec 06 14:01:57 robomaster-1 nvargus-daemon[1224]: (Argus) Error InvalidState:  (propagating from src/api/ScfCaptureThread.cpp, function run(), line 110)
Dec 06 14:01:57 robomaster-1 nvargus-daemon[1224]: SCF: Error InvalidState: Session has suffered a critical failure (in src/api/Session.cpp, function capture(), line 734)
Dec 06 14:01:57 robomaster-1 nvargus-daemon[1224]: (Argus) Error InvalidState:  (propagating from src/api/ScfCaptureThread.cpp, function run(), line 110)
Dec 06 14:01:57 robomaster-1 nvargus-daemon[1224]: SCF: Error InvalidState: Session has suffered a critical failure (in src/api/Session.cpp, function capture(), line 734)
Dec 06 14:01:57 robomaster-1 nvargus-daemon[1224]: (Argus) Error InvalidState:  (propagating from src/api/ScfCaptureThread.cpp, function run(), line 110)
Dec 06 14:01:57 robomaster-1 nvargus-daemon[1224]: SCF: Error InvalidState: Session has suffered a critical failure (in src/api/Session.cpp, function capture(), line 734)
Dec 06 14:01:57 robomaster-1 nvargus-daemon[1224]: (Argus) Error InvalidState:  (propagating from src/api/ScfCaptureThread.cpp, function run(), line 110)
Dec 06 14:01:57 robomaster-1 nvargus-daemon[1224]: SCF: Error InvalidState: Timeout!! Skipping requests on sensor GUID 0, capture sequence ID = 187647121163622 draining session frameStart events 281470681743361
Dec 06 14:01:57 robomaster-1 nvargus-daemon[1224]:  (in src/services/capture/FusaCaptureViCsiHw.cpp, function waitCsiFrameStart(), line 532)
Dec 06 14:01:57 robomaster-1 nvargus-daemon[1224]: SCF: Error BadParameter: CC has already been disposed (in src/components/CaptureContainerManager.cpp, function dispose(), line 161)
Dec 06 14:01:57 robomaster-1 nvargus-daemon[1224]: SCF: Error BadParameter: CC has already been disposed (in src/components/CaptureContainerManager.cpp, function dispose(), line 161)
Dec 06 14:01:57 robomaster-1 nvargus-daemon[1224]: SCF: Error BadParameter: CC has already been disposed (in src/components/CaptureContainerManager.cpp, function dispose(), line 161)

These errors and symptoms are identical to the ones reported here.

JerryChang · December 7, 2023, 3:01am

hello jbl,

let me double confirm what’s the background running services.
for example, did you put CPU/GPU with heavy loading for testing camera stream?

here’s pre-built binary update with some potential fixes. Topic274905_Dec07.zip (72.8 KB)
please give it a try to based-on r35.4.1, and update /usr/lib/aarch64-linux-gnu/tegra/libnvfusacap.so for testing.

jbl · December 7, 2023, 3:51pm

Hi JerryChang,

Thanks for the updated binary! I tried it and unfortunately do not notice any improvements. I double checked that the library was updated properly.

There is no other application running when this issue occurs, the CPU load is minimal.

As documented here, it is necessary to remove R8 to make this camera compatible with the Jetson platform. Did everyone else make a similar modification? I am wondering if this might have something to do with these issues, i.e. is it possible that the camera resets itself because of a floating pin?

EDIT: Upon further review of the schematic, this shouldn’t be an issue since there is another resistor R7 that always pulls the enable line to a defined level.

JerryChang · December 8, 2023, 2:54am

hello jbl,

I’m afraid it’s hardware level failure to send intermittently signaling.

could you please give it a try with Argus sample, such as userAutoExposure, which has error handling mechanism implemented.
you may verify the automatic closure of the application, for instance, when a timeout failure received from camera pipeline. Argus will report it via EVENT_TYPE_ERROR, and the application has to shutdown gracefully.
then, you should restart the app to resume the capture without kill the nvargus-daemon service.

yaniv.nahum · December 12, 2023, 2:39pm

Just an update: After investigating, it turns out that the issue is likely electromagnetic interference.
We’re using new 30 cm cables, and simply rerouting them differently has solved the issue.
Thanks, @JerryChang.

jbl · December 14, 2023, 8:20am

Hi JerryChang,
Could you elaborate on why it is not possible to implement similar error handling mechanisms in the normal camera driver, while this seems to be no problem on the Raspberry Pi? EMI is a problem in many real-world applications, and the camera driver should not crash if a single frame is corrupted. While it is possible to improve the seriousness of this problem, it is often not acceptable for the camera driver to just stop working for a few seconds in many real-world applications.
Thanks,
Jan

JerryChang · December 14, 2023, 8:45am

there’s an error as long as you said… frame is corrupted.
please check my previous comment #13, there do have error handling mechanism, if error is detected, application can receive the error and exit gracefully.

please improve the quality of the MIPI signals.
we don’t have software implementation to ignore minor errors, or waive error status at the current release.

system · December 28, 2023, 8:47am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Video stream crashes in jetpack 5.1.1 xavier nx Jetson Xavier NX camera	15	906	December 25, 2023
IMX477 Camera Dropout Issue with Custom Carrier Board on Orin NX Jetson Orin NX camera	25	274	November 12, 2024
How to make Argus in Jetson 35.2.1 recover after a corrupted frame? Jetson AGX Orin camera , nvbugs	30	2463	July 7, 2023
JP5.1 nvarguscamera doesn't recover from single NVCSI failure Jetson AGX Xavier camera , nvbugs	51	4525	July 18, 2023
Camera crashes nvargus-daemon Jetson Nano camera	10	1284	August 10, 2022
Nvarguscamerasrc Fails to Stream Camera Jetson Orin NX camera	2	72	November 14, 2024
Argus errors on some boots Jetson TX2 camera , gstreamer	8	955	March 28, 2023
Camera problems using CSI MIPI camera on Jetson TX2 Jetson TX2 camera	6	1116	October 18, 2021
CSI Camera acquisition multiple problems when capturing more than (libArgus ; L4T-32.2 ; JetPack 4.2 Jetson TX2	14	2001	October 18, 2021
Argus: capture error handling is broken Jetson TX2 camera	30	4403	October 18, 2021

Nvargus crashes with unreliable CSI camera connections on Jetpack 5.1.2

Related topics