How to make Argus in Jetson 35.2.1 recover after a corrupted frame?

Hi,
This is a follow up to

We really need to make Jetson MIPI camera work in a mission critical system and we cannot have Argus abort after every corrupted frame.
Corrupted frames are unavoidable in most modern robotic systems because of EMI noise from motors are instruments.

I see that you are giving some people custom versions of camera-rtcpu-t234-rce.img with special features.
I wonder whether it is possible to make a version of camera-rtcpu-t234-rce.img, which will:

  1. Check CRC and reject corrupted frames
  2. Simply drop corrupted frames, do not report errors to argus, so that it will not abort the capture session, like it does now.
    Or, may be, there are settings to the existing FW, which can achieve the same?

Note: the simplest way to reproduce the failure is to briefly short D+/D- lines in MIPI bus together or to the ground. (I was told by sensor makers that this is normally a safe thing to do).
Then you will see corrupted images on the screen and them Argus will abort with errors:
Mar 30 15:39:04 orin2 kernel: [RCE] ERROR: camera-ip/vi5/vi5.c:745 [vi5_handle_eof] “General error queue is out of sync with frame queue. ts=54628931184736 sof_ts=54628998768736 gerror_code=2 gerror_data=a2 notify_bits=0”
Mar 30 15:39:09 orin2 nvargus-daemon[8753]: === argus_multi_camera[8963]: CameraProvider initialized (0xffff488d9bb0)SCF: Error InvalidState: Timeout waiting on frame start sensor guid 1, capture sequence ID = 657 (in src/service
s/capture/FusaCaptureViCsiHw.cpp, function waitCsiFrameStart(), line 514)

Thank you

3 Likes

hello jhnlmn,

it looks error handling mechanism did not works, we do reproduce the issue locally.
test environment… l4t-r35.3.1 + AGX Orin + IMX274.

let us check this issue internally.

1 Like

Would like to second this issue, we have the same problems. If there is anything we can do to help, let us know!

(see JP5.1 nvarguscamera doesn't recover from single NVCSI failure)

1 Like

@JerryChang
Any progress? Any hope of fixing it? We have an important demo next month and need to make camera stable.
If Nvidia cannot fix this SW, would it be possible to get sources for the camera FW ( may be under NDA ) so that we fix it ourselves?
Thank you

FYI,
here’s pre-built update to include Argus stability fixes. please refer to Topic 243051, comment #36.
you may based-on JetPack-5.1.1/l4t-r35.3.1 to update the binaries for verification.

We have been dealing with the Argus stability for a long time. I gave the patches a try, however it did not solve the problem - nvargus still gets locked in InvalidState and requires restart of the daemon. Attached is the latest log from 5.1.1+patches.
nvargus-daemon.txt (14.2 KB)

1 Like

Update: After more testing with the patches applied I observed some failures that did result in argus exit and restart, so it looks like there is improvement, but doesn’t seem to cover all cases. Also the error messages seem to be more descriptive.
Here are 2 cases that triggered argus restart:

  1. While running 4 camera streams with gstreamer, I terminated one with Ctrl-C.
Jun 10 14:19:21 orin nvargus-daemon[1273]: === gst-launch-1.0[1712]: CameraProvider destroyed (0xffffa472d8c0)=== gst-launch-1.0[1712]: Connection closed (FFFFAA93B900)free(): invalid next size (fast)
Jun 10 14:19:22 orin systemd[1]: nvargus-daemon.service: Main process exited, code=killed, status=6/ABRT
  1. The stream fails to start:
Jun 10 15:01:34 orin nvargus-daemon[2734]: === gst-launch-1.0[3357]: Connection established (FFFF577FA900)=== gst-launch-1.0[3357]: CameraProvider initialized (0xfffecc000c20)CAM: serial no file already exists, skips storing againLSC: LSC surface is not based on full res!
Jun 10 15:01:34 orin nvargus-daemon[2734]: corrupted size vs. prev_size
Jun 10 15:01:34 orin systemd[1]: nvargus-daemon.service: Main process exited, code=killed, status=6/ABRT

Note: These failure not easily reproducible. They happen once in a while.

hello Agtonomy,

as same as Topic 243051.
since there’s pre-built update to address the stability issue,
let’s submit a new thread for following-up your use-case. please share the complete repo steps and also the failure rate for reference. you may also leave the topic-id here for better tracking.

Hi,
The new version of libnvargus.so and libnvscf.so appears to be worse than previous one.
It appears to always crash on a timeout (this was already reported on JP5.1 nvarguscamera doesn't recover from single NVCSI failure - #57 by JerryChang)
The simplest way to reproduce it with any sensor, like IMX274, is to run a long capture using:
nvargus_nvraw --c 0 --file out --format “raw,jpg” --skipframes 1000
and then stop streaming:
echo 0 | sudo tee /sys/kernel/debug/camera-video0/streaming
Argus always crashes somewhere inside malloc or free:
Sometimes it is:
Stack trace of thread 3733:
#0 0x0000ffff8d7b7d78 __GI_raise (libc.so.6 + 0x33d78)
#1 0x0000ffff8d7a4aac __GI_abort (libc.so.6 + 0x20aac)
#2 0x0000ffff8d7f1f40 __libc_message (libc.so.6 + 0x6df40)
#3 0x0000ffff8d7f9344 malloc_printerr (libc.so.6 + 0x75344)
#4 0x0000ffff8d7f9c54 malloc_consolidate (libc.so.6 + 0x75c54)
#5 0x0000ffff8d7fafac _int_free (libc.so.6 + 0x76fac)
#6 0x0000ffff8d0fe548 _ZN3scf17EGLStreamProducerD2Ev (libnvscf.so + 0x145548)
#7 0x0000ffff8d0fe60c _ZN3scf17EGLStreamProducerD0Ev.localalias (libnvscf.so + 0x14560c)
#8 0x0000ffff8d01cc70 _ZN3scf7Session19destroyOutputStreamEPNS_13IOutputStreamE (libnvscf.so + 0x63c70)
#9 0x0000ffff8a4187fc _ZThn8_N5Argus19EGLOutputStreamImpl7destroyEv (libnvargus.so + 0x8c7fc)
#10 0x0000ffff8da13d08 n/a (libnvargus_socketserver.so + 0x11cd08)
#11 0x0000ffff8da1423c n/a (libnvargus_socketserver.so + 0x11d23c)
#12 0x0000ffff8da2b334 n/a (libnvargus_socketserver.so + 0x134334)
#13 0x0000ffff8da2b7b8 n/a (libnvargus_socketserver.so + 0x1347b8)
#14 0x0000ffff8da2b9e8 n/a (libnvargus_socketserver.so + 0x1349e8)
#15 0x0000ffff8da2ac98 n/a (libnvargus_socketserver.so + 0x133c98)
#16 0x0000ffff8da2ae74 n/a (libnvargus_socketserver.so + 0x133e74)
#17 0x0000ffff8cb3f624 start_thread (libpthread.so.0 + 0x7624)
#18 0x0000ffff8d85549c thread_start (libc.so.6 + 0xd149c)
Sometimes:

Stack trace of thread 3692:
#0 0x0000ffff82201d78 __GI_raise (libc.so.6 + 0x33d78)
#1 0x0000ffff821eeaac __GI_abort (libc.so.6 + 0x20aac)
#2 0x0000ffff8223bf40 __libc_message (libc.so.6 + 0x6df40)
#3 0x0000ffff82243344 malloc_printerr (libc.so.6 + 0x75344)
#4 0x0000ffff82243c54 malloc_consolidate (libc.so.6 + 0x75c54)
#5 0x0000ffff82244fac _int_free (libc.so.6 + 0x76fac)
#6 0x0000ffff81b48548 _ZN3scf17EGLStreamProducerD2Ev (libnvscf.so + 0x145548)
#7 0x0000ffff81b4860c _ZN3scf17EGLStreamProducerD0Ev.localalias (libnvscf.so + 0x14560c)
#8 0x0000ffff81a66c70 _ZN3scf7Session19destroyOutputStreamEPNS_13IOutputStreamE (libnvscf.so + 0x63c70)
#9 0x0000ffff7ee627fc _ZThn8_N5Argus19EGLOutputStreamImpl7destroyEv (libnvargus.so + 0x8c7fc)
#10 0x0000ffff8245dd08 n/a (libnvargus_socketserver.so + 0x11cd08)
#11 0x0000ffff8245e23c n/a (libnvargus_socketserver.so + 0x11d23c)
#12 0x0000ffff82475334 n/a (libnvargus_socketserver.so + 0x134334)
#13 0x0000ffff824757b8 n/a (libnvargus_socketserver.so + 0x1347b8)
#14 0x0000ffff824759e8 n/a (libnvargus_socketserver.so + 0x1349e8)
#15 0x0000ffff82474c98 n/a (libnvargus_socketserver.so + 0x133c98)
#16 0x0000ffff82474e74 n/a (libnvargus_socketserver.so + 0x133e74)
#17 0x0000ffff81589624 start_thread (libpthread.so.0 + 0x7624)
#18 0x0000ffff8229f49c thread_start (libc.so.6 + 0xd149c)

Stack trace of thread 3179:
#0 0x0000ffff82295f08 __GI___poll (libc.so.6 + 0xc7f08)
#1 0x0000ffff7fe98340 n/a (libcuda.so.1 + 0x23a340)
#2 0x0000ffff7fe88dd4 n/a (libcuda.so.1 + 0x22add4)
#3 0x0000ffff7fe915c8 n/a (libcuda.so.1 + 0x2335c8)
#4 0x0000ffff81589624 start_thread (libpthread.so.0 + 0x7624)
#5 0x0000ffff8229f49c thread_start (libc.so.6 + 0xd149c)

Stack trace of thread 3709:
#0 0x0000ffff8159041c futex_wait_cancelable (libpthread.so.0 + 0xe41c)
#1 0x0000ffff8190a894 n/a (libnvos.so + 0xa894)
#2 0x0000ffff81909098 NvOsSemaphoreWaitTimeout (libnvos.so + 0x9098)
#3 0x0000ffff81af2000 ZN3scf23CameraEventWorkerThread12workerThreadEPS0 (libnvscf.so + 0xef000)
#4 0x0000ffff81909114 n/a (libnvos.so + 0x9114)
#5 0x0000ffff81589624 start_thread (libpthread.so.0 + 0x7624)
#6 0x0000ffff8229f49c thread_start (libc.so.6 + 0xd149c)

Stack trace of thread 3724:
#0 0x0000ffff82291bf0 __GI___libc_read (libc.so.6 + 0xc3bf0)
#1 0x0000ffff7f61be7c n/a (libnvodm_imager.so + 0x41e7c)
#2 0x0000ffff81909114 n/a (libnvos.so + 0x9114)
#3 0x0000ffff81589624 start_thread (libpthread.so.0 + 0x7624)
#4 0x0000ffff8229f49c thread_start (libc.so.6 + 0xd149c)

Stack trace of thread 3694:
#0 0x0000ffff8159041c futex_wait_cancelable (libpthread.so.0 + 0xe41c)
#1 0x0000ffff81908eb0 NvOsConditionWait (libnvos.so + 0x8eb0)
#2 0x0000ffff819d11f4 _ZNK13nvcamerautils17ConditionVariable4waitERKNS_5MutexE (libnvcamerautils.so + 0xb1f4)
#3 0x0000ffff81a875ec _ZN3scf14FiberScheduler12getNextFiberEPNS_15ExecutionThreadEPPNS_5FiberE (libnvscf.so + 0x845ec)
#4 0x0000ffff81a829ac _ZN3scf15ExecutionThread3runEv (libnvscf.so + 0x7f9ac)
#5 0x0000ffff81909114 n/a (libnvos.so + 0x9114)
#6 0x0000ffff81589624 start_thread (libpthread.so.0 + 0x7624)
#7 0x0000ffff8229f49c thread_start (libc.so.6 + 0xd149c)

Stack trace of thread 3727:
#0 0x0000ffff8159041c futex_wait_cancelable (libpthread.so.0 + 0xe41c)
#1 0x0000ffff81908eb0 NvOsConditionWait (libnvos.so + 0x8eb0)
#2 0x0000ffff819d11f4 _ZNK13nvcamerautils17ConditionVariable4waitERKNS_5MutexE (libnvcamerautils.so + 0xb1f4)
#3 0x0000ffff81ab8f1c _ZN3scf10AsyncStage18processNextRequestEv (libnvscf.so + 0xb5f1c)
#4 0x0000ffff81ab9440 _ZN3scf10AsyncStage14threadFunctionEPv (libnvscf.so + 0xb6440)
#5 0x0000ffff81909114 n/a (libnvos.so + 0x9114)
#6 0x0000ffff81589624 start_thread (libpthread.so.0 + 0x7624)
#7 0x0000ffff8229f49c thread_start (libc.so.6 + 0xd149c)

Stack trace of thread 1078:
#0 0x0000ffff822a0028 __libc_accept (libc.so.6 + 0xd2028)
#1 0x0000ffff824744f4 n/a (libnvargus_socketserver.so + 0x1334f4)
#2 0x0000ffff8247469c n/a (libnvargus_socketserver.so + 0x13369c)
#3 0x0000aaaae9c007f4 n/a (nvargus-daemon + 0x7f4)
#4 0x0000ffff821eee10 __libc_start_main (libc.so.6 + 0x20e10)
#5 0x0000aaaae9c00850 n/a (nvargus-daemon + 0x850)
#6 0x0000aaaae9c00850 n/a (nvargus-daemon + 0x850)

hello jhnlmn,

please note that you should based-on JP-5.1.1/l4t-r35.3.1 release version to apply the pre-built update.
you may see-also Topic 243051, comment# 28 for the test steps in brief.

Yes, I am testing with 35.3.1 now and I can reproduce the crash 100% with these commands:
nvargus_nvraw --c 0 --file out --format “raw,jpg” --skipframes 1000
In another shell
echo 0 | sudo tee /sys/kernel/debug/camera-video0/streaming

It is possible that this is the same crash, which casperlyngesen.mogensen reported in JP5.1 nvarguscamera doesn't recover from single NVCSI failure - #49 by JerryChang except that I reproduced it consistently.

Did you try it yourself?

hello jhnlmn,

I don’t understand…
what’s real use-case to abruptly stopped sensor streaming while nvargus_nvraw is running.

First of all, software must not crash. Ever.
If Argus was seen crashing under any condition, it cannot be accepted into a mission critical system. We are now planning to abandon Argus for that reason and, may be, replacing Jetson by a more reliable platform.
Second,
echo 0 | sudo tee /sys/kernel/debug/camera-video0/streaming
sleep x
echo 1 | sudo tee /sys/kernel/debug/camera-video0/streaming
is a very good test - it suspend and then resumes camera streaming,
it simulates a situation with some missing frame due to noise or other reasons.
Note that v4l2 survives this test, but Argus does not.
Once Argus passes this basic test, I will start more complicated testing by corrupting specific parts of MIPI packets to see how well the system handles it.

hello jhnlmn,

I meant… why you testing with nvargus_nvraw application.
this is a tool for capturing frames. this do not have error handling capability.

had you ever test with below steps with nvarguscamerasrc plugin,
step1) launch gst pipeline to enable camera preview
$ gst-launch-1.0 nvarguscamerasrc ! 'video/x-raw(memory:NVMM),framerate=30/1,format=NV12' ! nvvidconv ! xvimagesink
step2) sending commands on the terminal to shutdown the stream,
# cd /sys/kernel/debug/camera-video0
# echo 0 > streaming

Because it is not nvargus_nvraw, which crashes, but nvargus-daemon,
more specifically the new libnvscf.so from Topic243051_Jun05.7z
Take a look at this codedump:
#2 0x0000ffff81efaf40 __libc_message (libc.so.6 + 0x6df40)
#3 0x0000ffff81f02344 malloc_printerr (libc.so.6 + 0x75344)
#4 0x0000ffff81f02c54 malloc_consolidate (libc.so.6 + 0x75c54)
#5 0x0000ffff81f03fac _int_free (libc.so.6 + 0x76fac)
#6 0x0000ffff81807548 _ZN3scf17EGLStreamProducerD2Ev (libnvscf.so + 0x145548)
This means that the new libnvscf.so caused heap corruption or doubke free or some similar error.
The old libnvscf.so (from 35.3.1) did not do that.
It should be trivial for your developers to fix this before anybody saw it.
Thank you

had you ever test with below steps with nvarguscamerasrc plugin,

No, even if I do not see crashes with nvarguscamerasrc, but nvargus_nvraw still fails after even brief MIPI corruption with messages:

Jun 28 23:30:15 orin2 kernel: [RCE] VM0 deactivating.VM0 activating.ERROR: camera-ip/vi5/vi5.c:745 [vi5_handle_eof] "General error queue is out of sync with frame queue. ts=770161270944 sof
_ts=770163066144 gerror_code=2 gerror_data=400262 notify_bits=20010"
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: SCF: Error Timeout: Sending critical error event for Session 1
Jun 28 23:30:17 orin2 nvargus-daemon[1114]:  (in src/api/Session.cpp, function sendErrorEvent(), line 992)
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: SCF: Error InvalidState: Timeout!! Skipping requests on sensor GUID 1, capture sequence ID = 556 draining session frameStart events 1
Jun 28 23:30:17 orin2 nvargus-daemon[1114]:  (in src/services/capture/FusaCaptureViCsiHw.cpp, function waitCsiFrameStart(), line 532)
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: SCF: Error InvalidState: Sensor 1 already in same state
Jun 28 23:30:17 orin2 nvargus-daemon[1114]:  (in src/services/capture/CaptureServiceDeviceSensor.cpp, function setErrorState(), line 100)
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: SCF: Error InvalidState: Timeout!! Skipping requests on sensor GUID 1, capture sequence ID = 555 draining session frameEnd events 2
Jun 28 23:30:17 orin2 nvargus-daemon[1114]:  (in src/services/capture/FusaCaptureViCsiHw.cpp, function waitCsiFrameEnd(), line 635)
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: PowerServiceCore:handleRequests: timePassed = 2628
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: SCF: Error BadParameter: CC has already been disposed (in src/components/CaptureContainerManager.cpp, function dispose(), line 161)
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: SCF: Error InvalidState: Timeout!! Skipping requests on sensor GUID 1, capture sequence ID = 557 draining session frameStart events 1
Jun 28 23:30:17 orin2 nvargus-daemon[1114]:  (in src/services/capture/FusaCaptureViCsiHw.cpp, function waitCsiFrameStart(), line 532)
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: SCF: Error BadParameter: CC has already been disposed (in src/components/CaptureContainerManager.cpp, function dispose(), line 161)
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: SCF: Error BadParameter: CC has already been disposed (in src/components/CaptureContainerManager.cpp, function dispose(), line 161)
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: SCF: Error InvalidState: Sensor 1 already in same state
Jun 28 23:30:17 orin2 nvargus-daemon[1114]:  (in src/services/capture/CaptureServiceDeviceSensor.cpp, function setErrorState(), line 100)
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: SCF: Error InvalidState: Timeout!! Skipping requests on sensor GUID 1, capture sequence ID = 556 draining session frameEnd events 2
Jun 28 23:30:17 orin2 nvargus-daemon[1114]:  (in src/services/capture/FusaCaptureViCsiHw.cpp, function waitCsiFrameEnd(), line 635)
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: SCF: Error BadParameter: CC has already been disposed (in src/components/CaptureContainerManager.cpp, function dispose(), line 161)
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: SCF: Error InvalidState: Sensor 1 already in same state
Jun 28 23:30:17 orin2 nvargus-daemon[1114]:  (in src/services/capture/CaptureServiceDeviceSensor.cpp, function setErrorState(), line 100)
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: SCF: Error InvalidState: Timeout!! Skipping requests on sensor GUID 1, capture sequence ID = 557 draining session frameEnd events 1
Jun 28 23:30:17 orin2 nvargus-daemon[1114]:  (in src/services/capture/FusaCaptureViCsiHw.cpp, function waitCsiFrameEnd(), line 635)
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: SCF: Error BadParameter: CC has already been disposed (in src/components/CaptureContainerManager.cpp, function dispose(), line 161)
Jun 28 23:30:17 orin2 nvargus-daemon[1114]: Module_id 30 Severity 2 : (fusa) Error: Timeout  propagating from:/capture/src/fusaViHandler.cpp 776
Jun 28 23:30:20 orin2 nvargus-daemon[1114]: PowerServiceCore:handleRequests: timePassed = 2404
Jun 28 23:30:20 orin2 kernel: bwmgr API not supported

Note that if I read video directly from /dev/video0, then it tolerates corruption much better.

Second problem is that argus does not appear to check image CRC.
If corruption affects only pixels and not frame packet headers, then it shows corrupted image on the screen.
There must be a way to check CRC and reject corrupted frames.

may I know the details for testing with nvargus_nvraw application?

nvargus_nvraw is the best app for argus testing because it returns raw image, it return still image and it is easier to run than entire gstreamer pipeline.
However, on 6/20/2023 you requested that I retest with gstreamer and I did and the error, which I posted above, was with gstreamer (or with other argus sample apps).

how you interrupt the stream? are you sending software commands on the terminal to simulate it?