Argus: capture error handling is broken

We are using the TX2 SOM and have 6 cameras connected over FPD-Link III. There are three DS90UB954 deserializers, each connected to 4 CSI-2 lanes and receiving the streams of 2 cameras on different ports. The camera data is mapped to different VC-IDs on the UB954.

We use libnvargus.so for capturing streams from multiple cameras simultaneously. Occasionally we are seeing transmission errors on the FPD-Link III which result in a corrupt CSI-2 packet. This results in error messages from ARGUS being reported, on the serial console as well as via the API.

However, knowing these errors is useless. Neither is it possible to just capture the next frame in that CaptureSession, nor is it possible to restart the faulted CaptureSession. It is even impossible to stop all running captures and restart them. Somewhere in libnvargus.so there appears to be a deadlock that prevents waitForIdle() from returning and the CaptureSession object(s) from being deleted. The application using libnvargus.so just hangs until it calls the C function exit() from another thread to have the Linux kernel release all resources and put the hardware back into inoperational state.

It was also noticed that once a CSI-2 error is detected for a single CaptureSession, all other CaptureSession’s that are running concurrently also become inoperational. They either report errors or just hang.

We would expect one of the following things to happen in case of an CSI-2 transmission error:

  1. The CaptureSession delivers the already recorded frames. Instead of the defective frame an error is reported and afterwards the next valid frame can be captured again without any further intervention.
  2. Alternatively, the CaptureSession reports an error and becomes invalid. It is the responsibility of the application to destroy and re-create the faulted CaptureSession to resume capture. All other CaptureSessions remain valid and operational.

hello stefan.zegenhagen,

  1. according to Sensor Software Driver Programming Guide, would like to confirm you had specify the connection between VI and CSI correctly.
    for example,
    please connect three DS90UB954 to CSI-AB, CSI-CD, and CSI-EF respectively; and you’ll also need to map 6-camera to stream0 ~ stream6.

  2. please also check Approaches for Validating and Testing the V4L2 Driver. please also check with V4L2 standard controls to verify the basic functionality.

  3. you might also note that, please refer to SerDes Pixel Clock chapter and please review your serdes_pix_clk_hz property settings.
    thanks

Dear Jerry,

we did all the suggestions given above and the capture is working most of the time.

Occasionally, paritity errors are reported by the FPD-Link III deserializer which naturally result in either dropped or defective CSI-2 packets. This behaviour is configurable. However, in either case the frame cannot successfully be captured and ARGUS just hangs without proper recovery methods.

hello stefan.zegenhagen,

may I know which JetPack release you’re working with.
could you please also narrow down the issue by reduce the number of cameras.
thanks

Hi Jerry,

the problem was observed in all TX2 L4T releases between R28.1 and R32.2. Even with a single camera the problem exists.

Kind regards

hello stefan.zegenhagen,

  1. did you also confirmed with V4L2 standard controls for the basic functionality?
  2. please also refer to https://elinux.org/Jetson_TX2_Camera_BringUp to enable VI tracing logs to gather more details.

Hi Jerry,

we have brought up the sensors fine. There is no need to ask further questions about that.

There are occasional transmission errors on the FPD-Link which cause defective CSI-2 packets. I can see from the ARGUS logging output that the CSI-2 error was detected and that error is correctly flagged to the application by ARGUS.

All this behaviour is correct and expected in our setup.

What is wrong is that ARGUS afterwards (after detecting and reporting the error) just hangs in a deadlock and cannot be stopped or recover from that error. What is also wrong is that not only the failed capture session is hanging, but all others are as well.

hello stefan.zegenhagen,

we had multiple camera test-case with CSI camera modules without failures.
assume there’s some timing issue with respect to Ser/Des chips.
I’ve increase a timeout tolerance, could you please replace the attach library (devtalk1066106_Nov08.tar.gz) for testing.

BTW, it’s based-on l4t-r32.2,
please replace /usr/lib/aarch64-linux-gnu/tegra/libnvscf.so and perform a warm-reboot to make the change take effect.
thanks
devtalk1066106_Nov08.tar.gz (2.73 MB)

Hi Jerry,

I tried the updated library, but no success. The following information is output by ARGUS. Note that this is only an example, different detailed CSI/VI error codes may be printed depending on where the exact CSI-2 error is.

Note again that we expect such errors to happen from time to time because the image data is transmitted through a very long serial cable in noisy environments where transmission failures do occur.

NvViErrorDecode Stream 2.1 failed: ts 59260703159168 frame 66 error 2 data 0x000000a2
NvViErrorDecode CaptureError: CsimuxFrameError (2)
NvViErrorDecode See https://wiki.nvidia.com/wmpwiki/index.php/Camera_Debugging/CaptureError_debugging for more information and links to documents.
CsimuxFrameError_Regular : 0x000000a2
    Stream ID                [ 2: 0]: 2
        
    VPR state from fuse block    [ 3]: 0
        
    Frame end (FE)              [ 5]: 1
        A frame end has been found on a regular mode stream.
    FS_FAULT                    [ 7]: 1
        A FS packet was found for a virtual channel that was already in frame. An errored FE packet was injected before FS was allowed through.
SCF: Error Timeout:  (propagating from src/services/capture/CaptureServiceDevice.cpp, function issueCaptures(), line 1130)
SCF: Error Timeout:  (propagating from src/common/Utils.cpp, function workerThread(), line 116)
SCF: Error Timeout: Worker thread CaptureScheduler frameStart failed (in src/common/Utils.cpp, function workerThread(), line 133)
SCF: Error BadValue: timestamp cannot be 0 (in src/services/capture/NvViCsiHw.cpp, function waitCsiFrameEnd(), line 711)
SCF: Error BadValue:  (propagating from src/common/Utils.cpp, function workerThread(), line 116)
SCF: Error BadValue: Worker thread ViCsiHw frameComplete failed (in src/common/Utils.cpp, function workerThread(), line 133)
captureErrorCallback Stream 2.1 capture 2402 failed: ts 59260703159168 frame 66 error 2 data 0x000000a2

SCF: Error Timeout:  (propagating from src/api/Buffer.cpp, function waitForUnlock(), line 637)
SCF: Error Timeout:  (propagating from src/components/CaptureContainerImpl.cpp, function returnBuffer(), line 358)
SCF: Error InvalidState: Capture Scheduler not running (in src/services/capture/CaptureServiceDevice.cpp, function addNewItemToSchedule(), line 908)
SCF: Error InvalidState:  (propagating from src/services/capture/CaptureService.cpp, function addRequest(), line 395)
SCF: Error InvalidState:  (propagating from src/components/stages/MemoryToISPCaptureStage.cpp, function doHandleRequest(), line 137)
SCF: Error InvalidState:  (propagating from src/components/stages/OrderedStage.cpp, function doExecute(), line 158)
SCF: Error InvalidState: Sending critical error event (in src/api/Session.cpp, function sendErrorEvent(), line 990)
SCF: Error InvalidState: Capture Scheduler not running (in src/services/capture/CaptureServiceDevice.cpp, function addNewItemToSchedule(), line 908)
SCF: Error InvalidState:  (propagating from src/services/capture/CaptureService.cpp, function addRequest(), line 395)
SCF: Error InvalidState:  (propagating from src/components/stages/SensorCaptureStage.cpp, function doHandleRequest(), line 87)
SCF: Error InvalidState:  (propagating from src/components/stages/OrderedStage.cpp, function doExecute(), line 158)
SCF: Error InvalidState: Sending critical error event (in src/api/Session.cpp, function sendErrorEvent(), line 990)

The application receives the error for the stream and tries to stop all captures:

iCaptureSession->stopRepeat();
iCaptureSession->cancelRequests();
iCaptureSession->waitForIdle();

The waitForIdle() call does not succeed for any of the captures. It block infinitely.
When the call to waitForIdle() is deleted and capture session teardown is simply continued:

<delete FrameConsumer object>
<delete Request object>
<delete EGLOutputStream object>
<delete CaptureSession object>

ARGUS hangs infinitely at the point where the CaptureSession is deleted, for every active capture session.

Since the capture sessions are not deleted and the hardware is not released, it is impossible to restart any capture. The application needs to stop itself via calling the C function exit(). Only then do I see kernel messages about the V4L devices being stopped and the image sensors being shut down.

hello stefan.zegenhagen,

hold on, we should put more attention on the timing delta here.
camera pipeline were timing sensitive, you should also evaluate what’s the gap between your software capture request sending and the actual coming MIPI signaling.

suggest you should check Camera Software Development Solution.
could you please also enable the Infinite Timeout Support for verification.
thanks

Hi Jerry,

We already have the infinite timeout enabled. We had to because of the delay between telling the image sensor to cancel STANDBY mode and actual data flowing, but there’s a different thread for that (#1056919).

So for the last time: the captures are running fine, except for the occasional FPD-Link III transmission failures.

Kind regards.

hello stefan.zegenhagen,

would need more clues for digging into this.
please enable VI tracing logs to gather more details while the issue happened.
for example,

echo 1 > /sys/kernel/debug/tracing/tracing_on
echo 30720 > /sys/kernel/debug/tracing/buffer_size_kb
echo 1 > /sys/kernel/debug/tracing/events/tegra_rtcpu/enable
echo 1 > /sys/kernel/debug/tracing/events/freertos/enable
echo 2 > /sys/kernel/debug/camrtc/log-level
echo 1 > /sys/kernel/debug/tracing/events/camera_common/enable
echo > /sys/kernel/debug/tracing/trace
cat /sys/kernel/debug/tracing/trace

BTW,
had you tried to configure the system to performance mode for verification?
please refer to [Power Management for Jetson TX2 Series Devices] and check Maximizing Jetson TX2 Performance chapter for the commands.
thanks

Dear Jerry,

please stick to the original topic of this thread, which is that ARGUS cannot be stopped/restarted once a capture error has happened. This thread is not about eliminating FPD-Link III transmission errors, which no-one can.

It should be easy to reproduce that in the lab even without the logging you request: NVidia is likely to have test equipment with the possibility to inject arbitrary errors.

Kind regards.

Just a note: ARGUS is also hanging while stopping capture sessions when disabling infinite timeout support and provoking a capture timeout. That should be easy to test just be not getting the image sensor out of STANDBY, not even CSI-2 error injection capability is needed.

hello stefan.zegenhagen,

there’s contradiction,
in theory, Argus expect the MIPI signaling continuous sending without failures.

your use-case that sending the image data through FPD-Link III with a very long serial cable in noisy environments where transmission failures occasionally.
I would put this for internal discussion, will update here after we have conclusions.
thanks

Hi Jerry,

thank you for putting this topic to discussion.

Since error detection and reporting is already built-in to ARGUS, graceful handling of errors would only be the next logical step.

Kind regards.

I’m curious if anything came from the internal discussion? We are facing the same issues: A.) a single error brings down all capture sessions and B.) resetting a capture session after the error results in the program hanging indefinitely. For a production system, we really need one of the behaviors stefan.zegenhagen mentioned in the first post:

  1. The CaptureSession delivers the already recorded frames. Instead of the defective frame an error is reported and afterwards the next valid frame can be captured again without any further intervention.
  2. Alternatively, the CaptureSession reports an error and becomes invalid. It is the responsibility of the application to destroy and re-create the faulted CaptureSession to resume capture. All other CaptureSessions remain valid and operational.

Btw: expected solution #2 would only be valid if the time required to re-create the capture session is in the order of a few frame times, possibly less than 100ms.

Currently we find that restarting ARGUS capture sessions takes 2-3 seconds, during which no frames can be sampled. This is far too much for our application.

hi all,

please note that, Argus expect the MIPI signaling continuous sending without failures. there’s invalid state reporting while receiving incomplete data.
we still do not have schedule for Argus error recovery mechanism currently,
the immediate solution/workaround we should have is restart camera application and also nvargus-daemon service.
thanks

Hi Jerry,

Is there a shorter term fix for the CaptureSession destructor infinite hang issue planned? That seems like a very simple issue to resolve. Having that would let the application shutdown the CaptureSessions and restart them vs having to kill the entire application. Still not a great solution, but just having that would help us.

1 Like