Jetson Nano ISP functionality performance

Hi all,

I’ve been trying to get multi-camera capture working using the Nano’s ISP functionality. Our test system is comprised of four 1MP RGGB bayer type cameras in 10 bit mode streaming at 30fps.

Until now I have been using the normal V4L driver, capturing RAW 10 images and processing them with custom written cuda kernels. This setup achieves good performance (low cpu usage)
Note: in order not to clutter this post I have taken one representative output of tegrastats which represents the mean usage closely

RAM 1354/3956MB (lfb 426x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU[15%@1479,5%@1479,3%@1479,6%@1479] EMC_FREQ 7%@1600 GR3D_FREQ 61%@537 VIC_FREQ 0%@140 APE 25 PLL@26.5C CPU@28.5C PMIC@100C GPU@25C AO@35C thermal@27C POM_5V_IN 2938/3146 POM_5V_GPU 442/498 POM_5V_CPU 442/546

For using functionality like auto exposure, debayering, denoising, etc… we were planning to use argus.
Running the argus-camera sample application and selection multiSession with 4 cameras uses half of all CPU resources

RAM 1608/3956MB (lfb 418x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [49%@1479,70%@1479,48%@1479,50%@1479] EMC_FREQ 17%@1600 GR3D_FREQ 32%@230 VIC_FREQ 31%@140 APE 25 PLL@30.5C CPU@32.5C PMIC@100C GPU@28.5C AO@38.5C thermal@30.25C POM_5V_IN 4340/4313 POM_5V_GPU 200/186 POM_5V_CPU 1446/1392

I have tried everything I could think of:

  • Using cuEGLStream consumer
  • Using egl renderer consumer
  • using Buffer streams instead of EGL streams
  • using single session with multiple streams
  • using multi session with single stream

All end up using about 50% of all 4 CPUs

Because our original bare V4L path was working well, I thought to try the new V4L2 argus path exposed in Jetpack 4.5 but with limited success and the following issues:

  • The YUV capture format is only available when opening device through v4l2_open calls, not the regular fd open calls. This is not a problem.
  • However , opening a second camera using v4l2_open call does not work, it looks like something is hogging resources when opening the first camera. Even when the O_NONBLOCK flag is passed to v4l2_open, the following error is returned

v4l2_open(/dev/video1, 00000002)
(NvCameraUtils) Error InvalidState: Mutex already initialized (in Mutex.cpp, function initialize(), line 41)
(Argus) Error InvalidState: (propagating from src/rpc/socket/client/ClientSocketManager.cpp, function open(), line 54)
(Argus) Error InvalidState: (propagating from src/rpc/socket/client/SocketClientDispatch.cpp, function openSocketConnection(), line 258)
(Argus) Error InvalidState: Cannot create camera provider (in src/rpc/socket/client/SocketClientDispatch.cpp, function createCameraProvider(), line 102)
ArgusV4L2_Open failed: Invalid argument
1614068772:210:540 Error : Device not support V4L2_CAP_VIDEO_CAPTURE_MPLANE

  • CPU usage for just a single camera is already twice that of our original v4l app using 4 cameras
    RAM 1565/3956MB (lfb 385x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [20%@1479,17%@1479,20%@1479,20%@1479] EMC_FREQ 7%@1600 GR3D_FREQ 31%@153 VIC_FREQ 0%@140 APE 25 PLL@28.5C CPU@30C PMIC@100C GPU@27C AO@37C thermal@28.75C POM_5V_IN 3019/3099 POM_5V_GPU 120/133 POM_5V_CPU 724/77

  • Another strange thing is that with the argus V4L pipeline the timestamp in buffers are in milliseconds and start at 0 for the first frame (so not system time which makes it hard to sync with other types of sensors) and the sequence member of the v4l buffer does not increment

One more thing in general. I have noticed that in a single session capture with multiple streams, the metadata returned for all streams is a duplication of the first stream. Even when output stream is set to TYPE_BUFFER and buffers are captured at the consumers, the EGL image data inside the buffers is different but when getting the Ibuffer interface and querying its metadata, it appears that you always get the metadata of the first stream although that buffer belongs to another stream. The only way I have found around this, is using muliSession capture.

There have been posts around dealing with high CPU usage on ISP for Jetson TX2/XAVIER. But since TX1/Nano have a different ISP pipeline I posted this in the Jetson Nano category.

Can somebody please verify that 50% usage on all CPU cores for capturing four 1MP streams is what can be expected from Nano ISP performance?

Best regards,

hello Nico,

it’s correct that single session will duplicate the first stream settings to others. it usually the use-case of frame synchronization for single session per multi-stream;
hence, you should have multi-session if you’re going to execute sensor operation individually.

regarding to CPU usage for multiple camera use-case, due to Argus camera using EGL streams for rendering the frame to display, it’ll consume more CPU usage than standard v4l utility.
tegrastas shows the instance CPU usage, you may boost the CPU clocks to maximum to collect correct usage reports.

besides using tegrastas,
you may enable top commands, it’s by default in Solaris mode which expressed as a percentage of total CPU time; you may toggle ‘Irix/Solaris’ modes with the ‘I’ interactive command to enter Irix mode to check average CPU usage.

Hi @JerryChang,

Thanks for the follow up. If using multiSession is required to get different timestamps in a frame-synchronized camera setup, that’s not a problem for me if the performance is good.

I have adapted the syncSensor argus sample to perform multiSession capturing and removed all processing (main.cpp (16.6 KB) ). It is just acquiring frames and releasing them again. There is no creation of cuda object surfaces, no histogram computation nor any rendering being performed. Although this reduced the CPU usage to some degree, the difference between our original RAW10 +cuda processing pipeline and the argus pipeline remains huge.

Here are my timing result with Irix mode ON/OFF that show that, regardless of Irix mode, the argus pipeline consumes between 6x to 7x times the amount of CPU, even with NV Power Mode: MAXN

Custom V4L pipeline Irix mode ON

12982 nico 20 0 8585216 101352 91064 S 12.5 2.5 0:16.47 camera_recorder

Custom V4L pipeline Irix mode OFF

12982 nico 20 0 8585216 101352 91064 S 3.1 2.5 0:28.57 camera_recorder

Argus pipeline Irix mode ON

4779 root 20 0 10.697g 485940 38508 S 78.3 12.0 3:53.09 nvargus-daemon
13235 nico 20 0 9007432 40564 25064 S 10.3 1.0 0:01.60 argus_syncsenso

Argus pipeline Irix mode OFF

4779 root 20 0 10.698g 484688 38824 S 19.7 12.0 3:45.42 nvargus-daemon
13235 nico 20 0 9007432 40564 25064 S 2.6 1.0 0:00.59 argus_syncsenso

Best regards,

hello Nico,

please configure as Irix mode OFF to determine actual CPU usage.
Argus do consume more CPU usage than standard v4l utility as I mentioned previously.

may I know…
how many cameras are you going to enabled for your use-case?
what’s your criteria for the CPU usage, is there any issues with your use-case?

Hi Jerry,

assume you’d already review [Table 8-3. CSI Configuration] from Jetson Nano Product Design Guide.
could you please also check you’d same CPU usage reports with Applications Using V4L2 IOCTL Directly .

Yes I did, all CSI functionality is working fine and optimal. Our custom v4l implementation taking in RG10 and doing our own debayering and filtering in custom cuda kernels works very well, but takes away gpu time from other algorithms.
The ISP is supposed to relieve the CPU, not do the opposite as is the case now.

Running the following command:

v4l2-ctl --set-fmt-video=width=1280,height=800,pixelformat=RG10 --stream-mmap --stream-count=1000 -d /dev/video0

barely takes up any CPU. CPU usage is about 1-3% on a single CPU core
This kind of usage corresponds to our custom v4l pipeline we have now which also uses V4L2 ioctl directly with ISP disabled.

It’s from the moment we activate the ISP, either through libargus or the extended v4l2 argus pipeline from jetpack 4.5 that the CPU usage increases by no less than 6 to 7 times the CPU usage without ISP!

You can test it yourself by comparing the CPU usage from the above command with the code in jetson_multimedia_api/samples/unittest_samples/camera_unit_sample

and yes, I have tried removing display creation and GL rendering or doing anything useful with the acquired frames whatsoever such that the only thing that is different from the above command is enabling the ISP processing

Best regards,

hello Nico,

we’re also have evaluation the CPU usage with JetPack-4.5 /Jetson Nano/ 2x IMX219.
it’s running sensor mode 3264x2464@21-fps for camera preview; application: argus_camera or 13_multi_camera.
we’re using top and switch Irix mode to off by “I” (shift+i) to check the CPU usage in average.
the result shows it took ~4% CPU usage for single camera, and ~8% CPU usage for dual camera.

may I know what’s your sensor resolution and also the frame-rates.

Hi Jerry,

Today I double checked it on a freshly installed board. Here are the steps I followed:

  • unbox a brand new nano development kit and insert blank sdcard
  • attach 1 imx219 rpi camera (I do not have a second one available)
  • download latest sdk manager and install everything
  • log in to the board and issue following commands to upgrade and put in headless mode:

sudo apt update
sudo apt upgrade
sudo apt install v4l-utils htop
sudo systemctl disable gdm.service
sudo systemctl set-default
sudo reboot

-log in remotely and run htop

Sum of CPU usage on all cores is about 20% in idle mode (equal to 5% Solaris/Non-Irix mode because nano has 4 CPU cores)

  • run the following command

v4l2-ctl --set-fmt-video=width=3264,height=2464,pixelformat=RG10 --stream-mmap --stream-count=1000 -d /dev/video0

  • the v4l2-ctl process barely shows up in htop while executing: the sum of all 4 CPU usages remains about 20%

  • Now run the following command from camera_unit_sample with rendering disabled because of headless mode:

./camera_sample -nf 1000 -nr

  • The sum of all CPU usages now equals about 40% with nvargus-daemon showing up in the process list taking up 20%+ in htop (equal to an extra 5% in Solaris/Non-Irix mode)

So the measurements look to be correct and increase linearly with camera count as you have also found out yourself.

So to come back on my original use case with 4 cameras 1280x800,RG10,30fps

1 camera == 5% Solaris/Non-Irix mode == 20% of 1 CPU
2 cameras == 10% Solaris/Non-Irix mode == 40% of 1 CPU
3 cameras == 15% Solaris/Non-Irix mode == 60% of 1 CPU
4 cameras == 20% Solaris/Non-Irix mode == 80% of 1 CPU

Which means the equivalent of almost an entire CPU is being used just to take in the images in the 4 camera case.
From all my experiments I have found that, regardless of how you measure performance, be it using htop,tegrastats or top in Irix or Non-Irix mode: if you use the same measurement tool to measure both the v4l ioctl pipeline and the argus one, argus takes up a whole lot more CPU.

I can take in the 4 raw camera streams using regular v4l ioctl calls without any CPU breaking a sweat, but when enabling the ISP, it looks like I have to give up the equivalent of 80% of a CPU in processing power.

Best regards,

hello Nico,

we got identical results that Argus (i.e. ISP involved process) takes more CPU usage than v4l2 standard I/O controls.
you should check the average CPU usage for analysis,
it takes ~20% CPU usage for your use-case since there’re four CPUs on Jetson Nano.

please consider other Jetson products, such as Jetson TX2 series, or Jetson Xavier series if you’re looking for more powerful Jetson platforms.

Hi Jerry,

Thanks a lot for verifying, at least both or results seem to indicate the same thing.
If Nvidia can sell the TX2 NX at the same price as the Nano I’m sure our client wouldn’t mind us testing out that one :)
I’ll add these final results to our target platform selection speadsheet.

Thanks a lot for your support.
Wishing you all the best,