JP5.1 nvarguscamera doesn't recover from single NVCSI failure

hello pepijn.vanheiningen,

I cannot give you a solid answer since we have not root caused the issue.

let’s assume it’s the bug on camera stack,
for example, you may update the pre-built binary, (i.e. /usr/lib/aarch64-linux-gnu/tegra/libnvscf.so) this can be done remotely via ssh.
in this case, it’ll be simple to apply the fix. the update will take effect after a system reboot.

1 Like

Hey Jerry,

Do you happen to have any update on the topic? Did you get some resources for investigation? Any updates on a potential timeline?

If there is anything I can do to help out, let me know!

Thanks in advance!

hello pepijn.vanheiningen,

please refer to JetPack 5.1.1 is now live,

please have a test with the latest release image.
the error recovery mechanism should be functional with Jetpack 5.1.1 / l4t-35.3.1
moreover,
here shows test steps in brief.
step1) launch gst pipeline to enable camera preview
$ gst-launch-1.0 nvarguscamerasrc ! 'video/x-raw(memory:NVMM),framerate=30/1,format=NV12' ! nvvidconv ! xvimagesink
step2) sending commands on the terminal to shutdown the stream,
# cd /sys/kernel/debug/camera-video0
# echo 0 > streaming
the expected behavior is camera app it should terminate gracefully and we should not need to restart argus daemon service to restore the camera functionality.

1 Like

I just tried it on 35.3.1, but it still is stalling for me.

I’m doing the exact same steps as mentioned here, but the pipeline doesn’t continue to run. I’m also not seeing the NVCSI error handling that happened on 32.4.4.

CONSUMER: ERROR OCCURRED
ERROR: from element /GstPipeline:pipeline0/GstNvArgusCameraSrc:nvarguscamerasrc0: TIMEOUT
Additional debug info:
Argus Error Status
Execution ended after 0:00:08.178493900
Setting pipeline to NULL ...
GST_ARGUS: Cleaning up
(Argus) Error Timeout:  (propagating from src/rpc/socket/client/ClientSocketManager.cpp, function send(), line 137)
(Argus) Error Timeout:  (propagating from src/rpc/socket/client/SocketClientDispatch.cpp, function dispatch(), line 91)
(Argus) Error Timeout:  (propagating from src/rpc/socket/client/ClientSocketManager.cpp, function send(), line 137)
(Argus) Error Timeout:  (propagating from src/rpc/socket/client/SocketClientDispatch.cpp, function dispatch(), line 91)

Logs from nvargus-daemon:

Mar 30 07:42:58 camera nvargus-daemon[6549]: SCF: Error InvalidState: Timeout waiting on frame start sensor guid 0, capture sequence ID = 80 (in src/services/capture/FusaCaptureViCsiHw.cpp, function waitCsiFrameStart(), line 514)
Mar 30 07:42:58 camera nvargus-daemon[6549]: SCF: Error InvalidState:  (propagating from src/common/Utils.cpp, function workerThread(), line 114)
Mar 30 07:42:58 camera nvargus-daemon[6549]: SCF: Error InvalidState: Worker thread ViCsiHw frameStart failed (in src/common/Utils.cpp, function workerThread(), line 133)
Mar 30 07:42:58 camera nvargus-daemon[6549]: SCF: Error Timeout:  (propagating from src/services/capture/FusaCaptureViCsiHw.cpp, function waitCsiFrameEnd(), line 600)
Mar 30 07:42:58 camera nvargus-daemon[6549]: SCF: Error Timeout:  (propagating from src/common/Utils.cpp, function workerThread(), line 114)
Mar 30 07:42:58 camera nvargus-daemon[6549]: SCF: Error Timeout: Worker thread ViCsiHw frameComplete failed (in src/common/Utils.cpp, function workerThread(), line 133)
Mar 30 07:42:58 camera nvargus-daemon[6549]: Module_id 30 Severity 2 : (fusa) Error: Timeout  propagating from:/capture/src/fusaViHandler.cpp 776
Mar 30 07:43:00 camera nvargus-daemon[6549]: SCF: Error Timeout:  (propagating from src/services/capture/CaptureServiceDeviceViCsi.cpp, function waitCompletion(), line 368)
Mar 30 07:43:00 camera nvargus-daemon[6549]: SCF: Error Timeout:  (propagating from src/services/capture/CaptureServiceDevice.cpp, function pause(), line 936)
Mar 30 07:43:00 camera nvargus-daemon[6549]: SCF: Error Timeout: During capture abort, syncpoint wait timeout waiting for current frame to finish (in src/services/capture/CaptureServiceDevice.cpp, function handleCancelSourceRequests(), line 1029)
Mar 30 07:43:03 camera nvargus-daemon[6549]: waitForIdleLocked remaining request 180
Mar 30 07:43:03 camera nvargus-daemon[6549]: waitForIdleLocked remaining request 179
Mar 30 07:43:03 camera nvargus-daemon[6549]: SCF: Error Timeout: waitForIdle() timed out (in src/api/Session.cpp, function waitForIdleLocked(), line 922)
Mar 30 07:43:03 camera nvargus-daemon[6549]: SCF: Error Timeout:  (propagating from src/api/Session.cpp, function abortCaptures(), line 888)
Mar 30 07:43:59 camera nvargus-daemon[6549]: SCF: Error InvalidState: 2 buffers still pending during EGLStreamProducer destruction (propagating from src/services/gl/EGLStreamProducer.cpp, function freeBuffers(), line 300)
Mar 30 07:43:59 camera nvargus-daemon[6549]: SCF: Error InvalidState:  (propagating from src/services/gl/EGLStreamProducer.cpp, function ~EGLStreamProducer(), line 49)
Mar 30 07:44:04 camera nvargus-daemon[6549]: waitForIdleLocked remaining request 180
Mar 30 07:44:04 camera nvargus-daemon[6549]: waitForIdleLocked remaining request 179
Mar 30 07:44:04 camera nvargus-daemon[6549]: SCF: Error Timeout: waitForIdle() timed out (in src/api/Session.cpp, function waitForIdleLocked(), line 922)
Mar 30 07:44:04 camera nvargus-daemon[6549]: (Argus) Error Timeout:  (propagating from src/api/CaptureSessionImpl.cpp, function destroy(), line 216)
Mar 30 07:44:05 camera nvargus-daemon[6549]: SCF: Error Timeout:  (propagating from src/services/capture/CaptureServiceDeviceViCsi.cpp, function waitCompletion(), line 368)
Mar 30 07:44:05 camera nvargus-daemon[6549]: SCF: Error Timeout:  (propagating from src/services/capture/CaptureServiceDevice.cpp, function pause(), line 936)
Mar 30 07:44:05 camera nvargus-daemon[6549]: SCF: Error Timeout: During capture abort, syncpoint wait timeout waiting for current frame to finish (in src/services/capture/CaptureServiceDevice.cpp, function handleCancelSourceRequests(), line 1029)
Mar 30 07:44:09 camera nvargus-daemon[6549]: waitForIdleLocked remaining request 180
Mar 30 07:44:09 camera nvargus-daemon[6549]: waitForIdleLocked remaining request 179
Mar 30 07:44:09 camera nvargus-daemon[6549]: SCF: Error Timeout: waitForIdle() timed out (in src/api/Session.cpp, function waitForIdleLocked(), line 922)
Mar 30 07:44:09 camera nvargus-daemon[6549]: SCF: Error Timeout:  (propagating from src/api/Session.cpp, function abortCaptures(), line 888)
Mar 30 07:44:09 camera nvargus-daemon[6549]: SCF: Error Timeout:  (propagating from src/api/Session.cpp, function shutdown(), line 407)
Mar 30 07:44:10 camera nvargus-daemon[6549]: SCF: Error Timeout:  (propagating from src/services/capture/CaptureServiceDeviceViCsi.cpp, function waitCompletion(), line 368)
Mar 30 07:44:10 camera nvargus-daemon[6549]: SCF: Error Timeout:  (propagating from src/services/capture/CaptureServiceDevice.cpp, function pause(), line 936)
Mar 30 07:44:10 camera nvargus-daemon[6549]: SCF: Error Timeout: During capture abort, syncpoint wait timeout waiting for current frame to finish (in src/services/capture/CaptureServiceDevice.cpp, function handleCancelSourceRequests(), line 1029)

hello pepijn.vanheiningen,

thanks for sharing test results.

may I know how you enable camera stream,
could you please try with Argus example, i.e. Argus/public/samples/userAutoExposure

Still the same as before:

gst-launch-1.0 nvarguscamerasrc ee-mode=0 tnr-mode=0 aeantibanding=0 silent=false ! fakesink

I have tried the userAutoExposure, but it stops working when the sensors are reset as well.

hello pepijn.vanheiningen,

please see-also forum topic, How to make Argus in Jetson 35.2.1 recover after a corrupted frame?

it looks error handling mechanism did not works, we do reproduce the issue locally.
test environment… l4t-r35.3.1 + AGX Orin + IMX274.

let us check this issue internally.

Hi,

We hit the exact same issue. Any updates on this?

hello casperlyngesen.mogensen,

here’s pre-built update, Topic243051_Jun05.7z (1.8 MB)
could you please based-on JetPack-5.1.1/l4t-r35.3.1 to update the binaries with attachment,
you may preform a warm-reboot, i.e. $ sudo reboot after replace the binary files.

Hi

Thank you very much, I will test it out this week and report back with results

Best Regards

Casper Mogensen

Hi again

I have tested a few days now. There is an improvement in the general handling of bad frames, but occasionally there is a Segmentation Fault i nvargus-daemon, which my gstreamer pipeline (Python + OpenCV) sometimes hangs on.

Anything you would like from me regarding logs for the seg fault?

We will start some longterm testing in the coming days

Best Regards
Casper

hello casperlyngesen.mogensen,

would you please narrow down the issue, you may exclude opencv for testing,
for example, can you reproduce the same by simply running gst pipeline to launch camera preview frames?

Sure, would you like logs from Argus?

It can take hours to hit the seg fault, but will get back, when it happens again

yes, please share Argus daemon log, and also kernel logs for reference.

Hi

I actually had my fault from earlier today in the systemd logs, attached the log from nvargus-daemon. I do not have the kernel log from that time, but the only relevant entry I see, is this:

[11536.474315] [RCE] ERROR: camera-ip/vi5/vi5.c:745 [vi5_handle_eof] “General error queue is out of sync with frame queue. ts=11558435347200 sof_ts=11558435754816 gerror_code=2 gerror_data=a2 notify_bits=0”

Will try to recreate without OpenCv, but do not know when that will happen

argus.log (58.2 KB)

BTW,
how often did you see segmentation fault, what’s the failure rate?
it’s tested locally with the steps as mentioned in comment #28, I don’t see such errors from my side.

It happens after the camera has been re-initialised some times. If you wrap that command in a while true on the command-line, that mimics how I run it.

while [ true ]; do gst-launch-1.0 nvarguscamerasrc sensor-id=6 ! 'video/x-raw(memory:NVMM),framerate=30/1,format=NV12' ! nvvidconv ! xvimagesink; done

I had a crash again this night after some hours
argus.log (10.6 KB)

hello casperlyngesen.mogensen,

according to the logs, there’s timeout and software stack handling this error.
it needs 5 seconds (or more) for internal process to recover the state.

may I know what’s the exactly failure?
for example,
is there an intermittent signaling on sensor side?
or… you’ve disconnect/connect the camera device physically?

let me double confirm what’s happening after segmentation fault.
(1) are you able to interrupt the process to re-run the gst pipeline?
(2) is it possible to recover by restarting Argus daemon?
$ sudo pkill nvargus-daemon
$ sudo systemctl start nvargus-daemon

Hi

Yes, i can recover just fine. Do not need to restart nvargus-daemon, just need to restart my application