JP5.1 nvarguscamera doesn't recover from single NVCSI failure

Hello there!

As part of our migration we’re seeing some changes in the error handling of the camera input between 32.4.4 and 35.1 on Xavier AGX.

I’m running a very simple pipeline:
gst-launch-1.0 nvarguscamerasrc ee-mode=0 tnr-mode=0 aeantibanding=0 silent=false ! fakesink

In L4T 32.4.4, whenever we get a broken frame, we saw this error occur in the nvargus-daemon. However, the pipeline continues to run.

Feb 16 09:42:38 camera nvargus-daemon[6204]: NvCaptureStatusErrorDecode Stream 0.0 failed: sof_ts 68330504004640 eof_ts 68330637136320 frame 0 error 2 data 0x000000a0
Feb 16 09:42:38 camera nvargus-daemon[6204]: NvCaptureStatusErrorDecode Capture-Error: CSIMUX_FRAME (0x00000002)
Feb 16 09:42:38 camera nvargus-daemon[6204]: CsimuxFrameError_Regular : 0x000000a0
Feb 16 09:42:38 camera nvargus-daemon[6204]:     Stream ID                [ 2: 0]: 0
Feb 16 09:42:38 camera nvargus-daemon[6204]:         
Feb 16 09:42:38 camera nvargus-daemon[6204]:     VPR state from fuse block    [ 3]: 0
Feb 16 09:42:38 camera nvargus-daemon[6204]:         
Feb 16 09:42:38 camera nvargus-daemon[6204]:     Frame end (FE)              [ 5]: 1
Feb 16 09:42:38 camera nvargus-daemon[6204]:         A frame end has been found on a regular mode stream.
Feb 16 09:42:38 camera nvargus-daemon[6204]:     FS_FAULT                    [ 7]: 1
Feb 16 09:42:38 camera nvargus-daemon[6204]:         A FS packet was found for a virtual channel that was already in frame.An errored FE packet was injected before FS was allowed through.
Feb 16 09:42:38 camera nvargus-daemon[6204]:     Binary VC number [3:2]   [27:26]: 0
Feb 16 09:42:38 camera nvargus-daemon[6204]:         To get full binary VC number, user need to concatenate VC[3:2] and VC[1:0] together.
Feb 16 09:42:38 camera nvargus-daemon[6204]: SCF: Error InvalidState: Capture error with status 2 (channel 0) (in src/services/capture/NvCaptureViCsiHw.cpp, function waitCsiFrameEnd(), line 880)

In L4T 35.1 we see this error instead, and the pipeline fully stops. This is problematic because it breaks our pipeline. Some sensors have this issue too often for it to start and stop completely every time.

Feb 16 09:20:50 camera nvargus-daemon[2164]: SCF: Error InvalidState: Timeout waiting on frame start sensor guid 0, capture sequence ID = 612 (in src/services/capture/FusaCaptureViCsiHw.cpp, function waitCsiFrameStart(), line 507)
Feb 16 09:20:50 camera nvargus-daemon[2164]: SCF: Error InvalidState:  (propagating from src/common/Utils.cpp, function workerThread(), line 114)
Feb 16 09:20:50 camera nvargus-daemon[2164]: SCF: Error InvalidState: Worker thread ViCsiHw frameStart failed (in src/common/Utils.cpp, function workerThread(), line 133)
Feb 16 09:20:50 camera nvargus-daemon[2164]: SCF: Error Timeout:  (propagating from src/services/capture/FusaCaptureViCsiHw.cpp, function waitCsiFrameEnd(), line 593)
Feb 16 09:20:50 camera nvargus-daemon[2164]: SCF: Error Timeout:  (propagating from src/common/Utils.cpp, function workerThread(), line 114)
Feb 16 09:20:50 camera nvargus-daemon[2164]: SCF: Error Timeout: Worker thread ViCsiHw frameComplete failed (in src/common/Utils.cpp, function workerThread(), line 133)
Feb 16 09:20:50 camera nvargus-daemon[2164]: Module_id 30 Severity 2 : (fusa) Error: Timeout  propagating from:/capture/src/fusaViHandler.cpp 776

Is there a way to enable the error recovery on 35.1 so the pipeline will continue to run, even when these types of errors occur?

1 Like

Hi @pepijn.vanheiningen

I haven’t seen this error before, could you try use the infinite timeout on nvargus before capture to see if it changes this behavior without any drawback?

sudo service nvargus-daemon stop
sudo enableCamInfiniteTimeout=1 nvargus-daemon

Regards,
Roberto Gutierrez,
Embedded SW Engineer at RidgeRun
Contact us: support@ridgerun.com
Developers wiki: https://developer.ridgerun.com/

Hi Roberto, Thanks for your help! I tried the infinite timeout but it doesn’t help unfortunately, still getting the same error.

1 Like

hello pepijn.vanheiningen,

may I know what’s the exactly failure whenever you get a broken frame?
is it due to unstable MIPI signal or something else?

Hi JerryChang, thank you for your response. We’re still investigating the root cause of the issue. Unfortunately we already have many devices in the field already that have this problem, so we need to be able to handle the error without the entire pipeline shutting down.

hello pepijn.vanheiningen,

could you please also test with Argus sample app. for example, userAutoExposure
this is sample application which include error handling, please try your use-case with this instead.
thanks

Hey Jerry!

I’m getting the same error with the userAutoExposure sample. It freezes when the error happens, getting this in the nvargus-daemon:

Feb 17 10:29:54 camera nvargus-daemon[23913]: SCF: Error InvalidState: Timeout waiting on frame start sensor guid 0, capture sequence ID = 2150 (in src/services/capture/FusaCaptureViCsiHw.cpp, function waitCsiFrameStart(), line 507)
Feb 17 10:29:54 camera nvargus-daemon[23913]: SCF: Error InvalidState:  (propagating from src/common/Utils.cpp, function workerThread(), line 114)
Feb 17 10:29:54 camera nvargus-daemon[23913]: SCF: Error InvalidState: Worker thread ViCsiHw frameStart failed (in src/common/Utils.cpp, function workerThread(), line 133)
Feb 17 10:29:54 camera nvargus-daemon[23913]: SCF: Error Timeout:  (propagating from src/services/capture/FusaCaptureViCsiHw.cpp, function waitCsiFrameEnd(), line 593)
Feb 17 10:29:54 camera nvargus-daemon[23913]: SCF: Error Timeout:  (propagating from src/common/Utils.cpp, function workerThread(), line 114)
Feb 17 10:29:54 camera nvargus-daemon[23913]: SCF: Error Timeout: Worker thread ViCsiHw frameComplete failed (in src/common/Utils.cpp, function workerThread(), line 133)
Feb 17 10:29:54 camera nvargus-daemon[23913]: Module_id 30 Severity 2 : (fusa) Error: Timeout  propagating from:/capture/src/fusaViHandler.cpp 776

hello pepijn.vanheiningen,

it shows different error on r35.1, it’s capture engine wait for frames till timeout.
timeout is more critical and it sometimes being sensor configuration issues.

may I know what’s the exactly failure, is it due to unstable MIPI signal?

I can simulate it with a small script we created, essentially the MIPI signal stops for a short period of time (a few frames) and then starts again. Running the same script on 32.4.4 and 35.1 shows the difference in the error.

So it isn’t really a difference between the stability of the MIPI signals, but really something inside the error handling. I hope there is a way to get the original error handling where it doesn’t throw this ‘timeout waiting on frame start’ error, but only messages about some issues with the MIPI signal.

hello pepijn.vanheiningen,

may I know how you interrupt the stream for testing?
actually, you could toggle this debug node /sys/kernel/debug/camera-video0/streaming to alter the camera stream.
you may enable the camera application,
please use below to terminate the video stream,
# echo 0 > /sys/kernel/debug/camera-video0/streaming

We have some specific hardware that processes the MIPI signal, we reset that chip.

I will try to toggle the debug node to see if I can get the same results with that!

The directory /sys/kernel/debug/camera-video0 does not exist.

/sys/kernel/debug/camera-video0/streaming: No such file or directory

hello pepijn.vanheiningen,

please examine release tag, $ cat /etc/nv_tegra_release
I’ve confirm this debug node is created on JP-5.1
for example,

/sys/kernel/debug# ll camera-video*
camera-video0:
total 0
drwxr-xr-x  2 root root 0 Feb 15 07:45 ./
drwx------ 96 root root 0 Feb 15 07:46 ../
-rw-r--r--  1 root root 0 Feb 15 07:45 streaming

camera-video1:
total 0
drwxr-xr-x  2 root root 0 Feb 15 07:45 ./
drwx------ 96 root root 0 Feb 15 07:46 ../
-rw-r--r--  1 root root 0 Feb 15 07:45 streaming

or…
may I know what’s the camera type you’re using? is it a bayer sensor using CSI interface camera?

We are actually running 35.1/JP5.1:

head -n 1 /etc/nv_tegra_release
# R35 (release), REVISION: 1.0, GCID: 31250864, BOARD: t186ref, EABI: aarch64, DATE: Thu Aug 11 03:37:46 UTC 2022

Yes, we’re using a bayer sensor and use the CSI interface, but still not getting the camera-video folders:

/sys/kernel/debug# ll camera-video*
ls: cannot access 'camera-video*': No such file or directory

Looks like that debugfs node is not implemented in the driver for our sensor.

All right, so I implemented the camera-video* endpoint in our sensor driver. Starting and stopping briefly the sensor as follows, while in another terminal I run the gstreamer pipeline.

#!/bin/bash
echo 0 > /sys/kernel/debug/camera-video0/streaming
sleep 0.1
echo 1 > /sys/kernel/debug/camera-video0/streaming

This gives me the same Timeout waiting on frame start sensor guid 0 error as before. Do you see the same results on your camera?

P.S. I’m also getting an error in dmesg: [RCE] ERROR: camera-ip/vi5/vi5.c:745 [vi5_handle_eof] "General error queue is out of sync with frame queue. ts=1760342946848 sof_ts=1760343144128 gerror_code=2 gerror_data=400 notify_bits=0"

hello pepijn.vanheiningen,

FYI, I can reproduce the same issue on reference camera board.
for example,

Feb 21 13:38:34 nvidia-desktop nvargus-daemon[1789]: === gst-launch-1.0[18536]: CameraProvider initialized (0xffffa8684d70)SCF: Error InvalidState: Timeout waiting on frame start sensor guid 0, capture sequence ID = 942 (in src/services/capture/FusaCaptureViCsiHw.cpp, function waitCsiFrameStart(), line 524)
Feb 21 13:38:34 nvidia-desktop nvargus-daemon[1789]: SCF: Error InvalidState:  (propagating from src/common/Utils.cpp, function workerThread(), line 114)
Feb 21 13:38:34 nvidia-desktop nvargus-daemon[1789]: SCF: Error InvalidState: Worker thread ViCsiHw frameStart failed (in src/common/Utils.cpp, function workerThread(), line 133)
Feb 21 13:38:35 nvidia-desktop nvargus-daemon: Module_id 30 Severity 2 : (fusa) Error: Timeout  propagating from:/capture/src/fusaViHandler.cpp 776
Feb 21 13:38:35 nvidia-desktop nvargus-daemon[1789]: SCF: Error Timeout:  (propagating from src/services/capture/FusaCaptureViCsiHw.cpp, function waitCsiFrameEnd(), line 610)
Feb 21 13:38:35 nvidia-desktop nvargus-daemon[1789]: SCF: Error Timeout:  (propagating from src/common/Utils.cpp, function workerThread(), line 114)
Feb 21 13:38:35 nvidia-desktop nvargus-daemon[1789]: SCF: Error Timeout: Worker thread ViCsiHw frameComplete failed (in src/common/Utils.cpp, function workerThread(), line 133)
Feb 21 13:38:37 nvidia-desktop nvargus-daemon[1789]: SCF: Error Timeout:  (propagating from src/services/capture/CaptureServiceDeviceViCsi.cpp, function waitCompletion(), line 368)
Feb 21 13:38:37 nvidia-desktop nvargus-daemon[1789]: SCF: Error Timeout:  (propagating from src/services/capture/CaptureServiceDevice.cpp, function pause(), line 936)
Feb 21 13:38:37 nvidia-desktop nvargus-daemon[1789]: SCF: Error Timeout: During capture abort, syncpoint wait timeout waiting for current frame to finish (in src/services/capture/CaptureServiceDevice.cpp, function handleCancelSourceRequests(), line 1029)

this is a regression, let me arrange resources for checking this.
in the meanwhile.
please have below as temporary solution to kill and restart nvargus-daemon service to restore the camera functionality.
$ sudo pkill nvargus-daemon
$ sudo systemctl start nvargus-daemon

Thank you for reproducing it! Unfortunately the temporary solution doesn’t work for us, since this can happen many times per hour. Restarting the nvargus-daemon and our pipelines will take a while, where we lose video.

Do you have any idea when a fix might be available? This is currently blocking the production of our new cameras.

hello pepijn.vanheiningen,

we haven’t root cause the issue, and it may take some time to figure-out the solution.
let me arrange resources for investigation, you should also expect this won’t be fix soon.

Thanks for your continued effort on this.

One additional question: do you have some insight into how easy it will be to update our Xaviers over-the-air later? Or do you think this fix will need to be applied when flashing the device?

Please keep me up-to-date if you learn more about the root cause of the problem!