How to make Argus in Jetson 35.2.1 recover after a corrupted frame?

We have been dealing with the Argus stability for a long time. I gave the patches a try, however it did not solve the problem - nvargus still gets locked in InvalidState and requires restart of the daemon. Attached is the latest log from 5.1.1+patches.
nvargus-daemon.txt (14.2 KB)

1 Like

Update: After more testing with the patches applied I observed some failures that did result in argus exit and restart, so it looks like there is improvement, but doesn’t seem to cover all cases. Also the error messages seem to be more descriptive.
Here are 2 cases that triggered argus restart:

  1. While running 4 camera streams with gstreamer, I terminated one with Ctrl-C.
Jun 10 14:19:21 orin nvargus-daemon[1273]: === gst-launch-1.0[1712]: CameraProvider destroyed (0xffffa472d8c0)=== gst-launch-1.0[1712]: Connection closed (FFFFAA93B900)free(): invalid next size (fast)
Jun 10 14:19:22 orin systemd[1]: nvargus-daemon.service: Main process exited, code=killed, status=6/ABRT
  1. The stream fails to start:
Jun 10 15:01:34 orin nvargus-daemon[2734]: === gst-launch-1.0[3357]: Connection established (FFFF577FA900)=== gst-launch-1.0[3357]: CameraProvider initialized (0xfffecc000c20)CAM: serial no file already exists, skips storing againLSC: LSC surface is not based on full res!
Jun 10 15:01:34 orin nvargus-daemon[2734]: corrupted size vs. prev_size
Jun 10 15:01:34 orin systemd[1]: nvargus-daemon.service: Main process exited, code=killed, status=6/ABRT

Note: These failure not easily reproducible. They happen once in a while.

hello Agtonomy,

as same as Topic 243051.
since there’s pre-built update to address the stability issue,
let’s submit a new thread for following-up your use-case. please share the complete repo steps and also the failure rate for reference. you may also leave the topic-id here for better tracking.

Hi,
The new version of libnvargus.so and libnvscf.so appears to be worse than previous one.
It appears to always crash on a timeout (this was already reported on JP5.1 nvarguscamera doesn't recover from single NVCSI failure - #57 by JerryChang)
The simplest way to reproduce it with any sensor, like IMX274, is to run a long capture using:
nvargus_nvraw --c 0 --file out --format “raw,jpg” --skipframes 1000
and then stop streaming:
echo 0 | sudo tee /sys/kernel/debug/camera-video0/streaming
Argus always crashes somewhere inside malloc or free:
Sometimes it is:
Stack trace of thread 3733:
#0 0x0000ffff8d7b7d78 __GI_raise (libc.so.6 + 0x33d78)
#1 0x0000ffff8d7a4aac __GI_abort (libc.so.6 + 0x20aac)
#2 0x0000ffff8d7f1f40 __libc_message (libc.so.6 + 0x6df40)
#3 0x0000ffff8d7f9344 malloc_printerr (libc.so.6 + 0x75344)
#4 0x0000ffff8d7f9c54 malloc_consolidate (libc.so.6 + 0x75c54)
#5 0x0000ffff8d7fafac _int_free (libc.so.6 + 0x76fac)
#6 0x0000ffff8d0fe548 _ZN3scf17EGLStreamProducerD2Ev (libnvscf.so + 0x145548)
#7 0x0000ffff8d0fe60c _ZN3scf17EGLStreamProducerD0Ev.localalias (libnvscf.so + 0x14560c)
#8 0x0000ffff8d01cc70 _ZN3scf7Session19destroyOutputStreamEPNS_13IOutputStreamE (libnvscf.so + 0x63c70)
#9 0x0000ffff8a4187fc _ZThn8_N5Argus19EGLOutputStreamImpl7destroyEv (libnvargus.so + 0x8c7fc)
#10 0x0000ffff8da13d08 n/a (libnvargus_socketserver.so + 0x11cd08)
#11 0x0000ffff8da1423c n/a (libnvargus_socketserver.so + 0x11d23c)
#12 0x0000ffff8da2b334 n/a (libnvargus_socketserver.so + 0x134334)
#13 0x0000ffff8da2b7b8 n/a (libnvargus_socketserver.so + 0x1347b8)
#14 0x0000ffff8da2b9e8 n/a (libnvargus_socketserver.so + 0x1349e8)
#15 0x0000ffff8da2ac98 n/a (libnvargus_socketserver.so + 0x133c98)
#16 0x0000ffff8da2ae74 n/a (libnvargus_socketserver.so + 0x133e74)
#17 0x0000ffff8cb3f624 start_thread (libpthread.so.0 + 0x7624)
#18 0x0000ffff8d85549c thread_start (libc.so.6 + 0xd149c)
Sometimes:

Stack trace of thread 3692:
#0 0x0000ffff82201d78 __GI_raise (libc.so.6 + 0x33d78)
#1 0x0000ffff821eeaac __GI_abort (libc.so.6 + 0x20aac)
#2 0x0000ffff8223bf40 __libc_message (libc.so.6 + 0x6df40)
#3 0x0000ffff82243344 malloc_printerr (libc.so.6 + 0x75344)
#4 0x0000ffff82243c54 malloc_consolidate (libc.so.6 + 0x75c54)
#5 0x0000ffff82244fac _int_free (libc.so.6 + 0x76fac)
#6 0x0000ffff81b48548 _ZN3scf17EGLStreamProducerD2Ev (libnvscf.so + 0x145548)
#7 0x0000ffff81b4860c _ZN3scf17EGLStreamProducerD0Ev.localalias (libnvscf.so + 0x14560c)
#8 0x0000ffff81a66c70 _ZN3scf7Session19destroyOutputStreamEPNS_13IOutputStreamE (libnvscf.so + 0x63c70)
#9 0x0000ffff7ee627fc _ZThn8_N5Argus19EGLOutputStreamImpl7destroyEv (libnvargus.so + 0x8c7fc)
#10 0x0000ffff8245dd08 n/a (libnvargus_socketserver.so + 0x11cd08)
#11 0x0000ffff8245e23c n/a (libnvargus_socketserver.so + 0x11d23c)
#12 0x0000ffff82475334 n/a (libnvargus_socketserver.so + 0x134334)
#13 0x0000ffff824757b8 n/a (libnvargus_socketserver.so + 0x1347b8)
#14 0x0000ffff824759e8 n/a (libnvargus_socketserver.so + 0x1349e8)
#15 0x0000ffff82474c98 n/a (libnvargus_socketserver.so + 0x133c98)
#16 0x0000ffff82474e74 n/a (libnvargus_socketserver.so + 0x133e74)
#17 0x0000ffff81589624 start_thread (libpthread.so.0 + 0x7624)
#18 0x0000ffff8229f49c thread_start (libc.so.6 + 0xd149c)

Stack trace of thread 3179:
#0 0x0000ffff82295f08 __GI___poll (libc.so.6 + 0xc7f08)
#1 0x0000ffff7fe98340 n/a (libcuda.so.1 + 0x23a340)
#2 0x0000ffff7fe88dd4 n/a (libcuda.so.1 + 0x22add4)
#3 0x0000ffff7fe915c8 n/a (libcuda.so.1 + 0x2335c8)
#4 0x0000ffff81589624 start_thread (libpthread.so.0 + 0x7624)
#5 0x0000ffff8229f49c thread_start (libc.so.6 + 0xd149c)

Stack trace of thread 3709:
#0 0x0000ffff8159041c futex_wait_cancelable (libpthread.so.0 + 0xe41c)
#1 0x0000ffff8190a894 n/a (libnvos.so + 0xa894)
#2 0x0000ffff81909098 NvOsSemaphoreWaitTimeout (libnvos.so + 0x9098)
#3 0x0000ffff81af2000 ZN3scf23CameraEventWorkerThread12workerThreadEPS0 (libnvscf.so + 0xef000)
#4 0x0000ffff81909114 n/a (libnvos.so + 0x9114)
#5 0x0000ffff81589624 start_thread (libpthread.so.0 + 0x7624)
#6 0x0000ffff8229f49c thread_start (libc.so.6 + 0xd149c)

Stack trace of thread 3724:
#0 0x0000ffff82291bf0 __GI___libc_read (libc.so.6 + 0xc3bf0)
#1 0x0000ffff7f61be7c n/a (libnvodm_imager.so + 0x41e7c)
#2 0x0000ffff81909114 n/a (libnvos.so + 0x9114)
#3 0x0000ffff81589624 start_thread (libpthread.so.0 + 0x7624)
#4 0x0000ffff8229f49c thread_start (libc.so.6 + 0xd149c)

Stack trace of thread 3694:
#0 0x0000ffff8159041c futex_wait_cancelable (libpthread.so.0 + 0xe41c)
#1 0x0000ffff81908eb0 NvOsConditionWait (libnvos.so + 0x8eb0)
#2 0x0000ffff819d11f4 _ZNK13nvcamerautils17ConditionVariable4waitERKNS_5MutexE (libnvcamerautils.so + 0xb1f4)
#3 0x0000ffff81a875ec _ZN3scf14FiberScheduler12getNextFiberEPNS_15ExecutionThreadEPPNS_5FiberE (libnvscf.so + 0x845ec)
#4 0x0000ffff81a829ac _ZN3scf15ExecutionThread3runEv (libnvscf.so + 0x7f9ac)
#5 0x0000ffff81909114 n/a (libnvos.so + 0x9114)
#6 0x0000ffff81589624 start_thread (libpthread.so.0 + 0x7624)
#7 0x0000ffff8229f49c thread_start (libc.so.6 + 0xd149c)

Stack trace of thread 3727:
#0 0x0000ffff8159041c futex_wait_cancelable (libpthread.so.0 + 0xe41c)
#1 0x0000ffff81908eb0 NvOsConditionWait (libnvos.so + 0x8eb0)
#2 0x0000ffff819d11f4 _ZNK13nvcamerautils17ConditionVariable4waitERKNS_5MutexE (libnvcamerautils.so + 0xb1f4)
#3 0x0000ffff81ab8f1c _ZN3scf10AsyncStage18processNextRequestEv (libnvscf.so + 0xb5f1c)
#4 0x0000ffff81ab9440 _ZN3scf10AsyncStage14threadFunctionEPv (libnvscf.so + 0xb6440)
#5 0x0000ffff81909114 n/a (libnvos.so + 0x9114)
#6 0x0000ffff81589624 start_thread (libpthread.so.0 + 0x7624)
#7 0x0000ffff8229f49c thread_start (libc.so.6 + 0xd149c)

Stack trace of thread 1078:
#0 0x0000ffff822a0028 __libc_accept (libc.so.6 + 0xd2028)
#1 0x0000ffff824744f4 n/a (libnvargus_socketserver.so + 0x1334f4)
#2 0x0000ffff8247469c n/a (libnvargus_socketserver.so + 0x13369c)
#3 0x0000aaaae9c007f4 n/a (nvargus-daemon + 0x7f4)
#4 0x0000ffff821eee10 __libc_start_main (libc.so.6 + 0x20e10)
#5 0x0000aaaae9c00850 n/a (nvargus-daemon + 0x850)
#6 0x0000aaaae9c00850 n/a (nvargus-daemon + 0x850)

hello jhnlmn,

please note that you should based-on JP-5.1.1/l4t-r35.3.1 release version to apply the pre-built update.
you may see-also Topic 243051, comment# 28 for the test steps in brief.

Yes, I am testing with 35.3.1 now and I can reproduce the crash 100% with these commands:
nvargus_nvraw --c 0 --file out --format “raw,jpg” --skipframes 1000
In another shell
echo 0 | sudo tee /sys/kernel/debug/camera-video0/streaming

It is possible that this is the same crash, which casperlyngesen.mogensen reported in JP5.1 nvarguscamera doesn't recover from single NVCSI failure - #49 by JerryChang except that I reproduced it consistently.

Did you try it yourself?

hello jhnlmn,

I don’t understand…
what’s real use-case to abruptly stopped sensor streaming while nvargus_nvraw is running.

First of all, software must not crash. Ever.
If Argus was seen crashing under any condition, it cannot be accepted into a mission critical system. We are now planning to abandon Argus for that reason and, may be, replacing Jetson by a more reliable platform.
Second,
echo 0 | sudo tee /sys/kernel/debug/camera-video0/streaming
sleep x
echo 1 | sudo tee /sys/kernel/debug/camera-video0/streaming
is a very good test - it suspend and then resumes camera streaming,
it simulates a situation with some missing frame due to noise or other reasons.
Note that v4l2 survives this test, but Argus does not.
Once Argus passes this basic test, I will start more complicated testing by corrupting specific parts of MIPI packets to see how well the system handles it.

hello jhnlmn,

I meant… why you testing with nvargus_nvraw application.
this is a tool for capturing frames. this do not have error handling capability.

had you ever test with below steps with nvarguscamerasrc plugin,
step1) launch gst pipeline to enable camera preview
$ gst-launch-1.0 nvarguscamerasrc ! 'video/x-raw(memory:NVMM),framerate=30/1,format=NV12' ! nvvidconv ! xvimagesink
step2) sending commands on the terminal to shutdown the stream,
# cd /sys/kernel/debug/camera-video0
# echo 0 > streaming

Because it is not nvargus_nvraw, which crashes, but nvargus-daemon,
more specifically the new libnvscf.so from Topic243051_Jun05.7z
Take a look at this codedump:
#2 0x0000ffff81efaf40 __libc_message (libc.so.6 + 0x6df40)
#3 0x0000ffff81f02344 malloc_printerr (libc.so.6 + 0x75344)
#4 0x0000ffff81f02c54 malloc_consolidate (libc.so.6 + 0x75c54)
#5 0x0000ffff81f03fac _int_free (libc.so.6 + 0x76fac)
#6 0x0000ffff81807548 _ZN3scf17EGLStreamProducerD2Ev (libnvscf.so + 0x145548)
This means that the new libnvscf.so caused heap corruption or doubke free or some similar error.
The old libnvscf.so (from 35.3.1) did not do that.
It should be trivial for your developers to fix this before anybody saw it.
Thank you

had you ever test with below steps with nvarguscamerasrc plugin,

No, even if I do not see crashes with nvarguscamerasrc, but nvargus_nvraw still fails after even brief MIPI corruption with messages:

Jun 28 23:30:15 orin2 kernel: [RCE] VM0 deactivating.VM0 activating.ERROR: camera-ip/vi5/vi5.c:745 [vi5_handle_eof] "General error queue is out of sync with frame queue. ts=770161270944 sof
_ts=770163066144 gerror_code=2 gerror_data=400262 notify_bits=20010"
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: SCF: Error Timeout: Sending critical error event for Session 1
Jun 28 23:30:17 orin2 nvargus-daemon[1114]:  (in src/api/Session.cpp, function sendErrorEvent(), line 992)
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: SCF: Error InvalidState: Timeout!! Skipping requests on sensor GUID 1, capture sequence ID = 556 draining session frameStart events 1
Jun 28 23:30:17 orin2 nvargus-daemon[1114]:  (in src/services/capture/FusaCaptureViCsiHw.cpp, function waitCsiFrameStart(), line 532)
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: SCF: Error InvalidState: Sensor 1 already in same state
Jun 28 23:30:17 orin2 nvargus-daemon[1114]:  (in src/services/capture/CaptureServiceDeviceSensor.cpp, function setErrorState(), line 100)
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: SCF: Error InvalidState: Timeout!! Skipping requests on sensor GUID 1, capture sequence ID = 555 draining session frameEnd events 2
Jun 28 23:30:17 orin2 nvargus-daemon[1114]:  (in src/services/capture/FusaCaptureViCsiHw.cpp, function waitCsiFrameEnd(), line 635)
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: PowerServiceCore:handleRequests: timePassed = 2628
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: SCF: Error BadParameter: CC has already been disposed (in src/components/CaptureContainerManager.cpp, function dispose(), line 161)
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: SCF: Error InvalidState: Timeout!! Skipping requests on sensor GUID 1, capture sequence ID = 557 draining session frameStart events 1
Jun 28 23:30:17 orin2 nvargus-daemon[1114]:  (in src/services/capture/FusaCaptureViCsiHw.cpp, function waitCsiFrameStart(), line 532)
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: SCF: Error BadParameter: CC has already been disposed (in src/components/CaptureContainerManager.cpp, function dispose(), line 161)
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: SCF: Error BadParameter: CC has already been disposed (in src/components/CaptureContainerManager.cpp, function dispose(), line 161)
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: SCF: Error InvalidState: Sensor 1 already in same state
Jun 28 23:30:17 orin2 nvargus-daemon[1114]:  (in src/services/capture/CaptureServiceDeviceSensor.cpp, function setErrorState(), line 100)
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: SCF: Error InvalidState: Timeout!! Skipping requests on sensor GUID 1, capture sequence ID = 556 draining session frameEnd events 2
Jun 28 23:30:17 orin2 nvargus-daemon[1114]:  (in src/services/capture/FusaCaptureViCsiHw.cpp, function waitCsiFrameEnd(), line 635)
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: SCF: Error BadParameter: CC has already been disposed (in src/components/CaptureContainerManager.cpp, function dispose(), line 161)
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: SCF: Error InvalidState: Sensor 1 already in same state
Jun 28 23:30:17 orin2 nvargus-daemon[1114]:  (in src/services/capture/CaptureServiceDeviceSensor.cpp, function setErrorState(), line 100)
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: SCF: Error InvalidState: Timeout!! Skipping requests on sensor GUID 1, capture sequence ID = 557 draining session frameEnd events 1
Jun 28 23:30:17 orin2 nvargus-daemon[1114]:  (in src/services/capture/FusaCaptureViCsiHw.cpp, function waitCsiFrameEnd(), line 635)
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: SCF: Error BadParameter: CC has already been disposed (in src/components/CaptureContainerManager.cpp, function dispose(), line 161)
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: Module_id 30 Severity 2 : (fusa) Error: Timeout  propagating from:/capture/src/fusaViHandler.cpp 776
Jun 28 23:30:20 orin2 nvargus-daemon[1114]: PowerServiceCore:handleRequests: timePassed = 2404
Jun 28 23:30:20 orin2 kernel: bwmgr API not supported

Note that if I read video directly from /dev/video0, then it tolerates corruption much better.

Second problem is that argus does not appear to check image CRC.
If corruption affects only pixels and not frame packet headers, then it shows corrupted image on the screen.
There must be a way to check CRC and reject corrupted frames.

may I know the details for testing with nvargus_nvraw application?

nvargus_nvraw is the best app for argus testing because it returns raw image, it return still image and it is easier to run than entire gstreamer pipeline.
However, on 6/20/2023 you requested that I retest with gstreamer and I did and the error, which I posted above, was with gstreamer (or with other argus sample apps).

how you interrupt the stream? are you sending software commands on the terminal to simulate it?

As I wrote at the beginning, the easiest way to simulate MIPI noise corruption is by briefly shorting some MIPI wires, such as D+/D-. The simplest test is just briefly touch exposed MIPI wires on some connector with a screwdriver. Depending where in MIPI frame the corruption happened you may see garbage on the screen (without capture interruption), you may see some MIPI rows lost (with the bottom portion of frame shifting up) and you may see a complete loss of frame. And then Argus fails. (but v4l survives this test).
The next test is to attach a transistor to those wires and then generate a short pulse synchronized with camera to corrupt only specific part of the frame for specified duration or corrupt only n-th frame or several frames in a row.

let me have clarification,
Argus and also camera driver expect sensor stream continuous sending frames without failures.

error handling mechanism is there to keep system alive.
since there’s intermittent signaling, it shows timeout failures from camera pipeline. Argus will report it via EVENT_TYPE_ERROR, and the application has to shutdown.

Argus and also camera driver expect sensor stream continuous sending frames without failures.

This is wrong expectation. This can only work in some hypothetical idealized camera. But in real world cameras occasional corruption is unavoidable. That’s why MIPI standard provides for CRC checksum for every packet - so that receiver can simply drop corrupted frame and continue receiving good frames.

error handling mechanism is there to keep system alive.

But Argus does not keep the system alive - it does everything wrong - it does NOT check CRC at all, thus allowing partially corrupted frame to be displayed and it closes camera streaming in case of other corrupted frames, thus killing the system.

As I wrote from the beginning, the correct behavior is:

  1. Check CRC and drop partially corrupted frame (or return it to the app with indication of corruption)
  2. Continue receiving good frames

The decision to stop streaming after corruption frames must belong to the application, not Argus, because every application has its own requirement about acceptable frame loss ratio.

firmware side has already checked the packet data CRC, seeing TRM for PH_WC and PF_CRC

No, it does not. If corruption affects only pixels and not the frame structure, then firmware ignores it and display is corrupted. Sometimes row headers are corrupted and then rows below it are shifted and discolored, but capture does not stop. And only if entire frame is broken, then capture stops. I attached few images - it took only few minutes to reproduce and capture them.