Nvarguscamerasrc stops recording - possible deadlock

Hello!

I work with the Nano on a camera recorder product which uses an IMX274 image sensor, with an embedded Linux build using Yocto and L4T R32.4.2. Internally, our application uses GStreamer with nvarguscamerasrc.

We’ve observed through our own testing and OEM customer reports that once every ~400 hours of recording (in individual 1 hour recording sessions), GStreamer simply stops streaming data. When we observe dmesg and nvargus-daemon journal logs, no immediate messages are observed - the failure is silent. We’ve carefully worked through our application running state and all application and GStreamer threads are active, and it appears that the issues are fully upstream in nvargus-daemon or further.

We’ve noted that nvargus-daemon creates several errors when our application detaches upon termination, with error messages that make us suspicious that a deadlock or other locking issue was achieved in nvargus-daemon previously and that this was the true root cause for video not being received. See below (waitForIdleLocked):

[ 3481.325162] application nvargus-daemon[3349]: CAM: serial no file already exists, skips storing againSCF: Error Timeout: (propagating from src/components/CaptureContainerImpl.cpp, function assignAllBuffersFromStream(), line 230)
[ 3481.325162] application nvargus-daemon[3349]: SCF: Error Timeout: (propagating from src/components/stages/CCDataSetupStage.cpp, function doHandleRequest(), line 68)
[ 3481.325162] application nvargus-daemon[3349]: SCF: Error Timeout: (propagating from src/components/stages/OrderedStage.cpp, function doExecute(), line 158)
[ 3481.325162] application nvargus-daemon[3349]: SCF: Error Timeout: Sending critical error event (in src/api/Session.cpp, function sendErrorEvent(), line 990)
[ 3481.327066] application nvargus-daemon[3349]: SCF: Error InvalidState: Session has suffered a critical failure (in src/api/Session.cpp, function capture(), line 667)
… Tens of thousands of repetitions of the SCF error …
[ 3573.042583] application systemd-journald[1669]: Suppressed 43983 messages from nvargus-daemon.service
[ 3573.042583] application nvargus-daemon[3349]: waitForIdleLocked remaining request 104133
[ 3573.042583] application nvargus-daemon[3349]: SCF: Error Timeout: waitForIdle() timed out (in src/api/Session.cpp, function waitForIdleLocked(), line 920)
[ 3573.042583] application nvargus-daemon[3349]: (Argus) Error Timeout: (propagating from src/api/CaptureSessionImpl.cpp, function destroy(), line 166)

Additionally, once nvargus-daemon enters this state it appears that it is often broken until it is restarted. We’ve observed two typical behaviors:

  1. nvargus-daemon crashes with a SIGSEGV.
  2. nvargus-daemon errors out on new GStreamer nvarguscamerasrc connections until it is restarted. Notably restarting nvargus-daemon without restarting the device will cause the video to recover, leading us to believe the issue lies solely within nvargus-daemon recovery routines. We can test the failed recovery by running GStreamer via command line and seeing the failure. Here’s an example of the error we see while trying to reattach without restarting nvargus-daemon:

system:~# gst-launch-1.0 nvarguscamerasrc ! ‘video/x-raw(memory:NVMM), width=(int)3840, height=(int)2160, format=(string)NV12, framerate=(fraction)24/1’ ! omxh264enc control-rate=2 bitrate=30000000 iframeinterval=6 ! ‘video/x-h264, stream-format=(string)byte-stream’ ! h264parse ! matroskamux ! filesink location=/home/default/test1.mkv
Setting pipeline to PAUSED …
Pipeline is live and does not need PREROLL …
Setting pipeline to PLAYING …
New clock: GstSystemClock
Framerate set to : 24 at NvxVideoEncoderSetParameterNvMMLiteOpen : Block : BlockType = 4
===== NVMEDIA: NVENC =====
NvMMLiteBlockCreate : Block : BlockType = 4
Error generated. /dvs/git/dirty/git-master_linux/multimedia/nvgstreamer/gst-nvarguscamera/gstnvarguscamerasrc.cpp, execute:568 Failed to create CaptureSession
H264: Profile = 66, Level = 40
Got EOS from element “pipeline0”.
Execution ended after 0:00:00.061002761
Setting pipeline to PAUSED …
Setting pipeline to READY …
Setting pipeline to NULL …
Freeing pipeline …
system:~#

Restarting nvargus-daemon, waiting 15 seconds, and then issuing the command above again results in successfully video recording.

We’ve observed several other forum threads regarding nvargus-daemon recovery issues, with most recommending a restart of nvargus-daemon as the solution. For example:

Unfortunately, our customer is quite displeased with this recovery approach due to its lack of stability and loss of recording time.

So, we have two questions:

  1. We’ve reviewed the release notes from 32.4.2 onwards (https://developer.nvidia.com/embedded/jetson-linux-archive), and fix 200661319 from 32.5 seems like it may be related to the issues we’ve observed above. Can NVIDIA provide more details around what was fixed here? It looks like the discussion thread was here: GStreamer lockup with H.264 encoder from nvarguscamerasrc
  2. Have there been any other fixes matching this description that we have overlooked?

this is very old release version. please moving to JetPack 4.6.4/ L4T 32.7.4 for verification if that’s possible.

BTW,
please see-also… JetPack 4 Reaches End of Life

Noted.
Are you able to provide any further detail on the bugfix 200661319?

it’s sync issue with g_queue,
there’s fix of adding error checks in queue operations making it thread-safe.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.