Argus high cpu usage streaming cameras

We have six csi cameras that we get images from using argus @ 30fps. We’ve noticed that just streaming the cameras by calling iFrameConsumer->acquireFrame() incurs a pretty high cpu load (~110% according to top). So it’s taking more than a core just to stream the cameras without any other processing. We’d love to reduce that if possible. Currently we are using JetPack 4.2.2 with argus in single process mode.

Based on profiling results, the biggest parts appear to be:

SCF_Execution. Looks like there are two SCF_Execution threads per camera (I see 12 of those threads). Those alone take up ~75%. According to the profiling something like ~10% of SCF_Execution is just powf() calls. I don’t have symbols to know what most of the other processing is doing, but it’s mostly all in libnvscf.so and libnvos.so. Is it possible some of this is image processing? or does that all happen on the ISP?

CaptureSchedule. This thread takes up ~20%. No useful symbols, but most of the time is again in libnvos.so and libnvscf.so.

After acquireFrame we are getting the IImageNativeBuffer and calling copyToNvBuffer to scale the image and save it into another nvbuffer. That also takes a good bit of cpu time (~2.5% for each camera) even though it uses the VIC. In the profiling results the cpu usage of copyToNvBuffer seems to break down into VicConfigure, VicCreateSession, VicExecute, VicFreeSession. Of that, only ~40% is in VicExecute. The rest is in the Configure, CreateSession, FreeSession so a lot of time seems to just be overhead.

Total time for SCF_Execution and CaptureSchedule seems to scale linearly with the fps. We can reduce the cpu usage by ~33% by running the cameras at 10fps instead of 30fps, but it isn’t a great solution. Is there anything else we could do to reduce the cpu load?

EDIT: if it matters, I’m using CAPTURE_INTENT_PREVIEW in createRequest and leaving everything else (edge enhancement mode, denoise mode, etc…) at defaults.

EDIT 2: after more testing it looks like the number of SCF_Execution threads isn’t dependent on the number of cameras being streamed. There are always 12 SCF_Execution threads regardless of number of cameras. The total CPU usage by argus (mainly in SCF_Execution and CaptureSchedule) scales mostly linearly as a function of total number of images per second (#_cameras * FPS).

EDIT 3: Argus creates a somewhat insane number of threads. From what I can tell, simply initializing Argus creates 25 threads. An additional ~15 are created per streamed camera. Most aren’t using much or any CPU and seem to be tied to different parts of the image pipeline - DeFogStage, GpuBlitStage, SharpenStage, VideoStabilization, HdfxStage, AoltmStage, etc… I’m really curious now exactly what the SCF_Execution and CaptureSchedule threads are doing to eat up so much CPU. Also, what processing happens on the ISP vs on the CPU or GPU and why so many threads are required.

Update test result as below:

Number of cameras CPU usage (Irix mode Off)
1 1.9 %
2 3.8 %
3 5.9 %
4 7.4 %
5 9.3 %
6 11.5 %

Test command:
$ sudo nvpmodel -m 0
$ sudo jetson_clocks
$ ./argus_camera -d 0 & (-d 1, 2, 3 4, 5)

Hi ShaneCCC,

For your test, what version of JetPack did you use, and what was the camera FPS and resolution?

When I test using the argus_camera app I don’t see any significant difference to my own app in terms of Argus’ CPU usage. The results match closely to what I stated in my original post.

Also, could you answer my other questions? What are the SCF_Execution and CaptureSchedule threads doing to use so much CPU? For reference, I have a thread that performs a remap operation (to undistort and rectify an image) on 120 VGA resolution (320x270) images per second. It only uses 20%. Argus uses roughly the same amount of CPU just to stream one camera at 30fps! It is either doing a significant amount of processing or a lot of busy waiting.

The result is from r32.3.1 and resolution is 2592x1944@30fps
I think there’s no much different with r32.4.x

Hi ShaneCCC,

Do you have any ideas for why I see higher usage than you do? Your reported usage is roughly 11.4% per camera. Mine is 18% per camera.

As another reference, I created a program that gets the raw bg12 data using v4l2 and does a linear demosaic on the CPU. With the cameras running at 1824x940 @ 30fps this app uses 7% per camera. That’s less than half what Argus does per camera.

Still very curious what it is the SCF_Execution and CaptureSchedule threads are doing to cause so much CPU load. Has Nvidia spent much time optimizing that code or profiling it? If that code is publicly available I can look into this myself, but guessing it’s not.

I found several other posts (linked at the bottom of this post) in these forums about the unexpectedly high CPU usage of Argus. These seem to match my measured CPU loads: ~15-20% of a core per camera @ 30fps and ~30-40% of a core per camera @ 60fps. Unfortunately, there are no answers provided as to why Argus uses so much CPU. It would be awesome if someone from Nvidia could answer these questions:

  1. What is Argus doing to cause such high CPU load since the heavy computations should be offloaded to separate hardware?
  2. Can Argus’ CPU usage be improved or is there a technical reason it cannot be?
  3. If Argus’ CPU usage can be improved, does Nvidia plan on addressing it? If so, what’s the timeline for that?

For a single camera system the current usage might not be problematic, but it adds up very quickly for multi-camera systems. It definitely has an impact on how we proceed algorithm wise.

1 Like

Argus run a lot of ISP feature and alg.
I think for 6 cameras at 2592x1944@30fps take 11% cpu usage should be fine.
We may have plan to reduce pipeline for the performance use case. But I can have the time line now.

Isn’t the algorithm processing done on the ISP though? and Argus just coordinates buffers between VI, ISP, and client app, and configures ISP? Doesn’t seem like it should be that CPU heavy. I’m still confused how the path that has access to the ISP manages to consume twice as much CPU as the path that debayers 1824x940 images on the CPU. Or why Argus needs to spawn 40 threads to stream one camera.

It would be great if Nvidia provided a more lightweight way to access the ISP. I saw on the Jetson roadmap that JetPack 4.5 lists support for CSI camera using ISP via v4l2. Is that going to be more CPU friendly?

1 Like

Do you mean nvv4l2camerasrc?
I think it could be similar with argus if include the ISP pipeline.

In case it’s useful for somebody else, I was able to save some CPU usage in our threads that interact with Argus by using an IBufferOutputStream (acquireBuffer) instead of IFrameConsumer (acquireFrame) to get frames. With IBufferOutputStream calling copyToNvBuffer can be avoided. Not sure why, but copyToNvBuffer is rather slow (4x as slow as calling NvBufferTransform based on my timings) . Doing that saved ~3% of a core per stream, so ~18-20% of a core total in my case with 6 cameras @ 30fps. CaptureSchedule and SCF_Execution still eat up a lot, but any improvement helps.

2 Likes

Hi,
Probably the issue is specific to the camera module. Could you share the camera module you use? We have camera partners listed in
https://developer.nvidia.com/embedded/community/ecosystem?specialty=camera

If you get the module from our partner, we can check with them to clarify CPU usage deviation.

Hi,
The result we see is to run 6-ov5693:
Argus high cpu usage streaming cameras - #3 by ShaneCCC
It is 2592x1944p30 in the case. Probably the resolution and framerate are different, triggering the deviation. Would be great if we can have more detail about the camera setting in seeing higher CPU usage.

The camera setup I was using was 6 - ov10650 1824x940p30. We’re working with D3, and have brought this to their attention. However, I don’t think they have anymore insight into Argus’ processing time than we do since they don’t have access to the source code or builds with symbols either.

I still think even 11% per camera is high considering I get ~7% grabbing the raw data with v4l2 and doing 1824x940p30 debayer on the CPU. Unless there is some serious processing being offloaded to the CPU I have a hard time understanding what could be causing Argus’ high usage. I realize NVIDIA doesn’t want to release the source code for Argus, but I doubt providing a rough breakdown of the CaptureSchedule and SCF_Execution threads processing would reveal sensitive information.

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Hi,
So you run 6 - ov10650 1824x940p30 and see similar CPU usage as 6-ov5693 2592x1944p30:

It this correct?