Argus errors with high cpu load and subsequent issues

Yes, the streaming works with the devtalk1065378_Oct25_prebuilts.tar.gz binaries. Ideally we’d be able to have all six cameras stream indefinitely (assuming no thermal issues). In practice, we’d have them running for ~8-10 hours at a time. The longest test I’ve done so far is ~2 hours.

I’ve been testing in a typical office environment with fluorescent lighting and some sunlight coming in through a window. It’s not low-light. I will try enabling ae-lock tomorrow morning.

I’m also curious what was changed in the binaries you provided? You mentioned increased internal queue buffers, was that the only change? It still seemed to hit the ISP related error, but there was no message spam or timeout in acquireFrame() when the error happened. Why was that exactly? That is actually the behavior we would like - if the ISP encounters an error on a frame, simply drop that frame and keep going - versus having the CaptureSession completely stop working.

hello kes25c,

camera architecture already support multi-cam use-case.
there’re several buffer transmit from low-level sensor driver layer to internal camera core stack, and finally rendering to frame display.
such Argus failure is due to a race condition between the buffer transmit to each stages, which only observed with high CPU loading.

FYI, according to Jetson TX2 Series Software Features of CSI and USB Camera Features.
you’ll also note that it did not claim running multi-cam solution with CPU stressed.

since I’m not able to recreate the failure in comment #6 with argus_camera application.
suggest you should also review your application implementation.
thanks

  • [i]since I'm not able to recreate the failure in comment #6 with argus_camera application. suggest you should also review your application implementation.[/i]

argus_camera produces the same errors for me. In fact, argus_camera produces them more frequently. Having AE-Lock enabled made no difference. Are you testing in multi-process or single-process mode? I built argus_camera in single-process.

  • [i]there're several buffer transmit from low-level sensor driver layer to internal camera core stack, and finally rendering to frame display. such Argus failure is due to a race condition between the buffer transmit to each stages, which only observed with high CPU loading.[/i]

I’m still confused as to what changed in the binaries you provided. It’s still hitting the Failed to fetch stats for frame XXX error, it just isn’t spamming error messages afterward and is able to continue. Is that effect from increasing the internal queue sizes? I figured that increasing the queue sizes would only help in preventing the error… not change behavior after the error is encountered.

Also, it isn’t only observed with high CPU loading. High CPU load just triggers it more easily. If there’s a race condition that can happen with high cpu load, then it can happen with low cpu load. You’re just hoping that it’s infrequent, which isn’t a viable plan for something in production.

Interestingly, I noticed that when other errors happen (i.e. not the Failed to fetch stats for frame XXX) I’m still getting the high rate message spam even with the new binaries. For example, when running argus_camera if I switch the sensor mode while in multi-session mode I almost always hit this error:

SCF: Error Timeout:  (propagating from src/components/amr/Snapshot.cpp, function waitForNewerSample(), line 92)
SCF_AutocontrolACSync failed to wait for an earlier frame to complete.
SCF: Error Timeout:  (propagating from src/components/ac_stages/ACSynchronizeStage.cpp, function doHandleRequest(), line 126)
SCF: Error Timeout:  (propagating from src/components/stages/OrderedStage.cpp, function doExecute(), line 137)
SCF: Error Timeout: Sending critical error event (in src/api/Session.cpp, function sendErrorEvent(), line 990)

And then I get the high-rate repeated message spam:

SCF: Error InvalidState: Session has suffered a critical failure (in src/api/Session.cpp, function capture(), line 667)
(Argus) Error InvalidState:  (propagating from src/api/ScfCaptureThread.cpp, function run(), line 109)
  • FYI, according to Jetson TX2 Series Software Features of CSI and USB Camera Features. you'll also note that it did not claim running multi-cam solution with CPU stressed.

What is the definition of CPU stressed to know when it should work or not?

hello kes25c,

What is the definition of CPU stressed to know when it should work or not?
it is NOT a realistic use-case by executing a test tools to stress CPU, (i.e. $ stress --cpu 5)
please share what’s your real use-case.

Are you testing in multi-process or single-process mode?
we’re running Argus Version: 0.97.3 with multi-process mode.

in addition,
we could share the pre-built libraries to tune the buffers in queue, the major purpose is to avoid the crash, also make the error decoding stats as non-fatal, just to return bad status.
please tried another pre-built libraries, devtalk1065378_Oct29_prebuilts.tar.gz
these also increase the timeout values and also configure the internal queue buffers.

again, a stress test tool to maximum CPU usage will induce frame-drop, that’s life.
you’ll need to evaluate the real workload, to have system optimization, to adjust priority…etc.
suggest you should also have system configuration, please refer to comment #2 for the steps.

to evaluate the workloads, you may use ubuntu default tool (i.e. $ top) to evaluate your system loading.
or, you could access nsight-systems tools to profile CPU usage.
thanks

devtalk1065378_Oct29_prebuilts.tar.gz (2.82 MB)

>>> it is NOT a realistic use-case by executing a test tools to stress CPU, (i.e. $ stress --cpu 5)
please share what’s your real use-case.

This was mentioned in the other thread, but you are focusing too much on the use of the stress program. It is only an accelerant. It doesn’t have to be stress --cpu 5. It happens with stress --cpu 3 or 2 or 1… just not as often. I’ve hit them with only the argus_camera app running in multi-session mode and nothing else, and we have hit them under normal load. For example, just doing some image pre-processing and depth on the gpu with minimal cpu usage besides Argus. That’s why I came to the forum. The stress program is simply a tool that helps to trigger the issue more readily.

>>> again, a stress test tool to maximum CPU usage will induce frame-drop, that’s life.
you’ll need to evaluate the real workload, to have system optimization, to adjust priority…etc.
suggest you should also have system configuration, please refer to comment #2 for the steps.

Frame drop is perfectly acceptable. I expect that to happen as the system load increases. The issue is not dropping frames, the issue is errors happening under the covers with no clean way to recover as mentioned by ben.lemond in the original thread. The current behavior results in a huge amount of log spam and the CaptureSession gets into a non-recoverable state. Trying to get the cameras going again after this happens only occasionally works and sometimes results in a segfault. Those are the issues. Currently, it seems that all of these errors are treated as fatal even though it’s possible to drop the frame and continue (at least for some of the errors). Argus should provide a better way for client apps to deal with these errors that can happen under the covers.

>>> we could share the pre-built libraries to tune the buffers in queue, the major purpose is to avoid the crash, also make the error decoding stats as non-fatal, just to return bad status.

OK, that’s what I was wondering. So you made the decoding stats error non-fatal, which is why it can continue after it happens. That behavior is much better for us, but it only helps with that one error unfortunately.

My intention is not to be combative. It’s to get a reliable system in place. You say that CPU stressed isn’t supported, but don’t define what that means. If necessary, we can restrict our code from running on two cores in order to reserve them for Argus, but I have not seen evidence that this will eliminate the issue. Only make it less likely to happen - in the same way that lower cpu load does. Since it happens with only argus_camera running, it seem unlikely to me that reserving cores for Argus will fix things. Maybe you could explain why these issues become impossible if we reserve two cores or more for Argus?

hello, kes25c

I’ve hit them with only the argus_camera app running in multi-session mode and nothing else, and we have hit them under normal load. For example, just doing some image pre-processing and depth on the gpu with minimal cpu usage besides Argus. That’s why I came to the forum. The stress program is simply a tool that helps to trigger the issue more readily.

okay, I might jump to conclusion too fast.
according to CSI and USB Camera Features, Preview performance of 30 frames/second for 1920×1440 resolution with six OV5693 sensors running simultaneously.

FYI,
we’re also verified there’s no such failure either with/without CPU stress tool.
let’s check whether this issue related to sensor drivers.

  1. may I know which sensor you’re working with, what’s the outputting resolution and frame-rate settings.
  2. besides running with argus multi-session mode, could you please tried launching six argus instance to check if could reproduce the same failures.
  3. you might also have confirmation to setup 6-cam streaming with v4l2 standard controls,
    for example,
$ v4l2-ctl -d /dev/video0 --set-fmt-video=width=1920,height=1080,pixelformat=RG10 --set-ctrl bypass_mode=0 --stream-mmap
$ v4l2-ctl -d /dev/video1 --set-fmt-video=width=1920,height=1080,pixelformat=RG10 --set-ctrl bypass_mode=0 --stream-mmap
...
$ v4l2-ctl -d /dev/video5 --set-fmt-video=width=1920,height=1080,pixelformat=RG10 --set-ctrl bypass_mode=0 --stream-mmap

Hi Jerry,

I was working on a streaming application using argus with a 3-camera setup. I ran into the same issue:

SCF: Error Timeout:  (propagating from src/components/amr/Snapshot.cpp, function waitForNewerSample(), line 92)
SCF_AutocontrolACSync failed to wait for an earlier frame to complete.
SCF: Error Timeout:  (propagating from src/components/ac_stages/ACSynchronizeStage.cpp, function doHandleRequest(), line 126)
SCF: Error Timeout:  (propagating from src/components/stages/OrderedStage.cpp, function doExecute(), line 137)
SCF: Error Timeout: Sending critical error event (in src/api/Session.cpp, function sendErrorEvent(), line 990)

And then I get the high-rate repeated message spam:

SCF: Error InvalidState: Session has suffered a critical failure (in src/api/Session.cpp, function capture(), line 667)
(Argus) Error InvalidState:  (propagating from src/api/ScfCaptureThread.cpp, function run(), line 109)

I am running R28.2.1 on my TX2 board with 3 IMX185 sensors. The drivers were provided by Leopard Imaging. Currently, Leopard Imaging does not have driver support for R32.2, which seems to have fixed some of these issues. Would R32.1 have the same fixes? Or, would I need to just wait for LI to come out with support for 32.2?

hello vision2,

to clarify, we did had bugs for R32.1 for multi-camera use-case.
therefore, the next l4t release, R32.2 include the fix to address some race condition failures.

since we suggest based-on R32.2 for multiple camera use-case implementation,
suggest you should ask your sensor vendor for R32.2 driver support,
thanks

Hi JerryChang

Let me clarify the L4T version you mentioned.

In https://developer.nvidia.com/embedded/downloads,
L4T 32.2.3 are already release on 2019/11/19

  • What does ‘R32.2’ mean?
  • Is the fix you mentioned same as the patch attached in comment #4

hello rary,

there’re JetPack release and also L4T sources packages.
however, it’s pre-built library update which only release with the JetPack images.

therefore, please check JetPack Archive for details.
thanks

Hi JerryChang,

Sorry for the late response. I was focused on other tasks. To answer your questions:

>>> 1) may I know which sensor you’re working with, what’s the outputting resolution and frame-rate settings.

We’re using the OV10640. 1280x1080 resolution @ 30fps. Output of v4l2-ctl -d /dev/video0 --list-formats-ext:

ioctl: VIDIOC_ENUM_FMT
	Index       : 0
	Type        : Video Capture
	Pixel Format: 'BA12'
	Name        : 12-bit Bayer GRGR/BGBG
		Size: Discrete 1280x1080
			Interval: Discrete 0.033s (30.000 fps)
		Size: Discrete 1280x1080
			Interval: Discrete 0.033s (30.000 fps)

	Index       : 1
	Type        : Video Capture
	Pixel Format: 'BG12'
	Name        : 12-bit Bayer BGBG/GRGR
		Size: Discrete 1280x1080
			Interval: Discrete 0.033s (30.000 fps)
		Size: Discrete 1280x1080
			Interval: Discrete 0.033s (30.000 fps)

>>> 2) besides running with argus multi-session mode, could you please tried launching six argus instance to check if could reproduce the same failures.

I should be able to test this tomorrow, and will let you know the result.

>>> 3) you might also have confirmation to setup 6-cam streaming with v4l2 standard controls,

I tried 6 camera streaming with v4l2 and stress --cpu 5. It ran for 30 minutes (twice) without problems, maintaining 30fps. The commands used were:

v4l2-ctl -d /dev/videoX --set-fmt-video=width=1280,height=1080,pixelformat=BG12 --set-ctrl bypass_mode=0 --stream-mmap --stream-count 54000

>>> we’re also verified there’s no such failure either with/without CPU stress tool.

Was that using argus_camera in multi-session mode with six cameras and stress --cpu 5? Or some other setup? and running for how long?

With 32.2, as long as we set a high (-20) priority for argus related threads, this issue happens pretty rarely under our normal system load.

I tested running six separate instances of argus_camera in capture mode, and am able to trigger the issue.

I also observed a separate issue related to multi-session mode, but it sort of fits here since it results in the high-rate repeated error spam. When running argus_camera (single process) in multi-session mode, if I switch the sensor mode it almost always triggers the following error (which is then followed by the high-rate repeating error messages):

SCF: Error Timeout:  (propagating from src/components/amr/Snapshot.cpp, function waitForNewerSample(), line 92)
SCF_AutocontrolACSync failed to wait for an earlier frame to complete.
SCF: Error Timeout:  (propagating from src/components/ac_stages/ACSynchronizeStage.cpp, function doHandleRequest(), line 126)
SCF: Error Timeout:  (propagating from src/components/stages/OrderedStage.cpp, function doExecute(), line 137)

In our case, there are two sensor modes. 0 (HDR) and 1 (linear). The steps are:

  1. start argus_camera
  2. switch to multisession mode
  3. wait a little while (for all the cameras to show up and stream)
  4. switch the sensor mode

Interestingly, the behavior appears different between single and multi process builds of argus_camera. If I repeat these steps with the multi-process build, I don’t see any errors show up in /var/log/syslog. The argus_camera app just hangs and becomes unresponsive.

hello kes25c,

  1. may I know those error report in comment #20 had already applied the pre-built libraries from comment #12, devtalk1065378_Oct29_prebuilts.tar.gz

  2. is this failure only reproduce with SDR and WDR mode switch?
    could you please also confirm the status by streaming all of them in the same sensor mode, also, how many cameras you’re working with?
    thanks

I’ve hit this same problem on a Nano with a single raspberry (v2) camera attached, with the system sitting doing very little:

Sep 22 06:04:00 maverick-nano nvargus-daemon[4535]: SCF: Error InvalidState: Session has suffered a critical failure (in src/api/Session.cpp, function capture(), line 667)
Sep 22 06:04:00 maverick-nano nvargus-daemon[4535]: (Argus) Error InvalidState:  (propagating from src/api/ScfCaptureThread.cpp, function run(), line 109)

Created 15Gb+ of syslog until the disk filled. Seriously guys, if you’re not going to fix the underlying problem, at least fix the syslog spamming so it doesn’t take the entire system down with it. This is a ridiculous bug to have around for years without a simple fix.

1 Like

Rebooted nano, immediately spamming when it restarts:

top - 08:31:25 up 22 min,  1 user,  load average: 2.95, 3.04, 2.43
Tasks: 225 total,   2 running, 223 sleeping,   0 stopped,   0 zombie
%Cpu(s): 23.2 us, 46.5 sy,  0.0 ni, 20.5 id,  9.4 wa,  0.3 hi,  0.2 si,  0.0 st
KiB Mem :  4059356 total,   508512 free,   828656 used,  2722188 buff/cache
KiB Swap:  2029664 total,  2029664 free,        0 used.  2921924 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 3386 syslog    20   0  359388   7136   3104 S 133.3  0.2  28:45.98 rsyslogd
 1906 root      19  -1   80012  25940  25144 R  99.3  0.6  21:45.12 systemd-journal
 4543 root      20   0  9.805g 153676  33120 S  39.9  3.8   7:57.23 nvargus-daemon
 1854 root      -2   0       0      0      0 S   2.6  0.0   0:10.21 mmcqd/0
 6385 root      20   0       0      0      0 D   2.0  0.0   0:00.74 kworker/u8:3
 4546 mav       20   0   30324  22312   8776 S   1.0  0.5   0:11.85 python3
 5299 mav       20   0  892936  44380  18744 S   0.7  1.1   0:08.68 mavros_node
 6995 root      20   0    9180   3652   2904 R   0.7  0.1   0:00.11 top
 4683 mav       20   0  258504  30708   8224 S   0.3  0.8   0:06.30 python3
 5276 mav       20   0  258320  29868   8508 S   0.3  0.7   0:06.41 python3

System is quiet except for all the logging activity. It runs a process doing simple video streaming through gstreamer. Perhaps the nvargus-daemon is self-triggering stress conditions with the insane amount of logging?

This insane logging activity only occurs when gstreamer is active. The system is running Jetpack 4.4 with the latest kernel 4.9.140-tegra.

@kes25c were you able to resolve these issues? I am on Jetpack 4.3 and seeing the same error messages when streaming multiple cameras using gstreamer.

Thanks,
Sanjay

The issue was never fixed AFAIK. However, the combination of the 32.2 release (jetpack 4.2.1 IIRC) and bumping the thread priority for argus threads way up (we use nice -15) basically eliminated this particular problem for us. We run six cameras with pretty heavy system load for several hours a day, and haven’t hit it in a while. We are using the argus library directly, not going through gstreamer.

1 Like

Any update or plan of fixing for this bug? at least if something crash inside can this service be restarted without rebooting the whole system?

@kes25c @SanjayD @sunxishan I add a repo with a description of all fixes for the camera work. Included information how run python opencv example and avoid error (Argus) Error InvalidState: (propagating from src/api/ScfCaptureThread.cpp, function run(), line 109)
and fix for the rebooting system (bludroid_pm module)
Hope it helps.

1 Like