Jetson Nano CPU load average for CSI video recording unusually high

juan.tettamanti · July 19, 2020, 11:15pm

Hello everyone,
I have recently noticed that the system load average while capturing frames on the Jetson Nano over CSI is unusually high (individual CPU load tends to remain low).
In order to rule out if this was something specific to my design, I ran a few tests on a Jetson Nano Devkit rev a02 with a single imx219 and Jetpack 4.4 (r32.4.2).

My first test was to capture video and send it nowhere using v4l2-ctl.
The commands for this test are the following:
v4l2-ctl -d /dev/video0 -c sensor_mode=2 --set-fmt-video=width=1920,height=1080,pixelformat=“RG10”
v4l2-ctl -d /dev/video0 -c gain=170 -c override_enable=1 -c bypass_mode=0 -c exposure=33333 -c frame_rate=30000000
v4l2-ctl -d /dev/video0 --stream-skip=100 --stream-count=120 --stream-mmap

For this test, the system load stabilizes around 1 after a while (more than 15 minutes).

For the second test, I decided to use gstreamer with nvargussrc.
The command for this one is the following one
gst-launch-1.0 nvarguscamerasrc ! ‘video/x-raw(memory:NVMM), format=NV12, width=1920, height=1080, framerate=30/1’ ! fakesink

For this test, the system load is a more conservative 0.45, with some spikes.

As a third test, we made v4l2-ctl capture a single video stream on our custom designed board and the load once again stabilized around 1.
As a fourth and final test, we made v4l2-ctl capture two video streams on two different cameras using the same custom board, with the load climbing to 2.
From the last two tests, it looks like the load scales linearly with the amount of cameras used.

The problem is, this product is supposed to support four CSI cameras and still be able to do something with them.
At best, this looks like a bug (at least on v4l2) and I’m really surprised no one seems to be bothered.
At worst, this means like I might have to throw away all the time spent working on the Jetson Nano and loose a lot of potential clients.

I hope there is some kind of solution for this issue.

Best Regards,
Juan Pablo.

JerryChang · July 20, 2020, 6:57am

hello juan.tettamanti,

looks like you’re using top commands to gather the usage.
you should also note that, its by default configure as Irix mode, it expressed as a percentage of total CPU time.

please refer to [url]https://linux.die.net/man/1/top[/url], you may switch Irix mode to off (Solaris mode) by hit “I” (shift + i), where a task’s usage will be divided by the total number of CPUs; such average CPU usage should be more reasonable results,
thanks

juan.tettamanti · July 27, 2020, 8:42pm

Hello @JerryChang,
So, I spent quite a while looking at the forum, the kernel code, using ftrace and a bunch other things.
After quite some time, I remembered that processes in the “uninterruptible sleep” state are considered by the average load metric (or so I had heard, and I confirmed that on a few different sources including the original commit message for that).

All my tracing had already told me that there was a really nice function in vi2_fops.c called tegra_channel_kthread_capture_start running a while loop.
This function was called every time my userspace program went to sleep (select) waiting for a new frame.
Inside this function there was a call to nvhost_syncpt_wait_timeout_ext which is located on nvhost_syncpt.c.
At the end of all this function calls, you can find a wait_event_timeout which changes the kthread state to “uninterruptible sleep”.

Since ftrace already confirmed almost all the time spent in select() closely matched this wait_event_timeout() / nvhost_syncpt_wait_timeout_ext() and I had also confirmed with top that this kthread was in DW (D = uninterruptible sleep) state, I decided to make a very simple change.
I modified the code so that it would enter the interruptible sleep mode and it worked the same, except now the average system load was almost zero.

This issue has given me quite the headache and I think an isr might have been a better choice.

Since I cannot change the way hardware communicates with the kernel, I’d like to know if, according to Nvidia’s experience, I should expect any problems derived from modifying nvhost_syncpt_wait_timeout_ext() so that it calls nvhost_syncpt_wait_timeout() with “interruptible = true”.

Best Regards,
Juan Pablo Tettamanti.

Here you can see the graph for nvhost_syncpt_wait_timeout()

This one is for the high load average with the vi-output kthread in uninterruptible sleep (DW) state

This one is for the negligible load average with the vi-output kthread in interruptible sleep (SW) state

JerryChang · July 28, 2020, 3:47am

hello juan.tettamanti,

nvhost_syncpt_wait_timeout_ext() will depends-on your CSI sensor frame-rate, you should expect the capture thread to issue a single frame capture request around (1/frame-rate) second.
moreover,
it’s VI2 driver to process Nano’s CSI camera. by default it’s running a single thread approach, which waiting for start-of-frame singling and write the frame into memory buffer. you may enable low_latency flags if you would like to reduce the capture latency.
please note that,
enable low_latency flags by checking end-of-frame signaling for buffer writing, the latency of each capture request should still close to (1/frame-rate) second.
for example,

static int tegra_channel_capture_frame()
{
    ...
        if (chan->low_latency)
                ret = tegra_channel_capture_frame_multi_thread(chan, buf);
        else
                ret = tegra_channel_capture_frame_single_thread(chan, buf);

juan.tettamanti · July 29, 2020, 3:23pm

Hello @JerryChang,
My question was whether changing interruptible from false to true would lead to unexpected issues (like the wait returning before a frame was available) or not.

Best Regards,
Juan Pablo.

juan.tettamanti · July 29, 2020, 5:00pm

Hello @JerryChang,
Unfortunately my previous change didn’t work as intended and it’s still a huge problem to use multiple cameras simultaneously.

For the time being, my proposed change enabled us to run v4l2-ctl for a single camera for several hours without any apparent impact on the average load.
However, when you try to do the same with two cameras, the load will once again go straight to one and the system will start missing deadlines on other critical tasks.

Strangely, this second issue seems to be dependent on how you time the v4l commands and sometimes it won’t happen at all.

Best Regards,
Juan Pablo.

JerryChang · July 30, 2020, 7:39am

hello juan.tettamanti,

please check Camera Design Guide , multiple camera use-case on Nano could support four 2-lane cameras.
may I know what’s your actual use-case, may I also know what’s your expected preserve bandwidth. had you refer to post #3 to switch to Solaris mode for checking the usage?

BTW,
it’s expect that gstreamer with nvarguscamerasrc taking more resources,
you may refer to Camera Architecture Stack, there’s post-process involved hence it is expected for nvarguscamerasrc taking more resources than v4l2 standard controls.
thanks

juan.tettamanti · July 30, 2020, 10:06pm

Hi @JerryChang,
I have checked your documentation several times since we started our hardware design around October 2019 (and also since you originally released the Nano around April 2019).
I have to say that I’ve found your documents to be unclear and slightly misleading on occasions.
They have also changed a lot, making things more complicated.

Right now I’m at a point where I’m seriously considering canceling the plans for 10k units based on Jetson products since I keep running into software related issues.

I plan to use 3 4-lane cameras, one connected to each 4-lane port since there are no plans to support virtual channels.
Each camera has been configured to use 800Mbps per lane, with a resolution of 1280x1080 pixels at 30fps and 12bpp.

On the software side of things, my use case would involve using simple v4l2 operations to capture the video stream from these cameras simultaneously.
So far, I’ve managed to correct a software (kernel) issue where the driver remained in the uninterruptible sleep state while waiting for a frame.
However, when using two cameras simultaneously, it seems like there is some interaction where the kernel wait on resources.
Even if userspace is not using the cpu at all, having a high load average usually suggests there is a bug of some sort.

Please note that I always refer to “load average” as opposed to “cpu usage” since the cpu is relatively free.

I don’t have any interest in using libargus (either directly or with gstreamer), since it requires adding X11, mesa-3d, opengl and some other heavy files / libraries to my (smaller than 300MB) custom image which I have to update using a cell connection in the middle of nowhere.

Best Regards,
Juan Pablo.

juan.tettamanti · July 30, 2020, 10:18pm

Hi @JerryChang,
I decided to attach a relatively simple test program I made recently.

This one will succeed when using one device.
It will also succeed with multiple usb devices on my computer.

However, when using two csi cameras on the Nano, the second one will return with " video4linux video1: frame start syncpt timeout!"
The error clearly comes from vi2_fops.c yet I’m not sure how to prevent this.

You can compile this with g++ v4l2-test.cpp -o v4l2-test -lv4l2 -lv4lconvert -lpthread

Best Regards,
Juan Pablo.

v4l2-test.cpp (7.8 KB)

juan.tettamanti · July 31, 2020, 2:23am

Hi @JerryChang,
The main reason for mentioning gstreamer was to point out that I found its behavior more reasonable on the nano and there could be a clue there regarding how to fix my current issue.

Best Regards,
Juan Pablo.

JerryChang · July 31, 2020, 2:54am

hello juan.tettamanti

thanks for sharing test app to reproduce the issue, we’ll setup an environment on Nano for investigation.

BTW,
I think you’re based-on Jetpack 4.4-DP since your l4t release was l4t-r32.4.2; please refer to JetPack Archive for details.
may I have double confirmation of your JetPack release version.
thanks

juan.tettamanti · July 31, 2020, 12:33pm

Hello @JerryChang ,
For the time being this has been tested on Jetpack 4.4-DP r32.4.2
I’d expect this to happen on other versions but haven’t checked yet

Best Regards,
Juan Pablo.

JerryChang · August 3, 2020, 2:06am

hello juan.tettamanti

FYI,
we’re able to reproduce the issue locally with two IMX219 on Nano / l4t-r32.4.3.
it failed to access second CSI camera on Nano for dual camera use-case with v4l2 standard controls;
however, dual-camera preview works normally with argus_camera/multi-session mode.

we’re tracking this in the internal bug system, will update the conclusion later.
thanks

JerryChang · August 3, 2020, 4:45am

hello juan.tettamanti

I’ve test locally, am able to enable dual camera for streaming with v4l2 standard controls,
could you please have a try with below commands,
for example,

$ v4l2-ctl -d /dev/video0 --set-fmt-video=width=1920,height=1080,pixelformat=RG10 --set-ctrl=sensor_mode=0 --stream-mmap --stream-count=1000
$ v4l2-ctl -d /dev/video1 --set-fmt-video=width=1920,height=1080,pixelformat=RG10 --set-ctrl=sensor_mode=0 --stream-mmap --stream-count=1000

here’s reference that it’s worked by enabling dual camera for streaming.

juan.tettamanti · August 6, 2020, 2:56pm

Hi @JerryChang,
After some tests and a suggestion from a coworker, things are working better now.

Changing the driver code to use interruptible sleep fixed the initial load issue with a single camera.
Fixing a seemingly unrelated issue where the frame rate from the dts was not matching the real frame rate from the device removed the load issue with multiple cameras.

The multi-threaded test code still won’t work, but that should be easier to fix ourselves.

Best Regards,
Juan Pablo.

JerryChang · August 7, 2020, 7:20am

hello juan.tettamanti,

FYI,
you might review Sensor Pixel Clock property settings for your frame-rate issue.

bmsp · April 23, 2021, 12:13am

We are also observing very high CPU usage for just simply streaming two CSI2 cameras.

Background

This forum has previously discussed unexplainable and concerningly high CPU loads when using the Nvidia Argus CSI2 Deepstream plugin (nvarguscamerasrc).

This reply aims to provide a concrete example that anyone with a Jetson device (we’re testing on Jetson NX) can run to see the high CPU usage for themselves.

Goal

This post is seeking to understand why the overhead is so high, if it can be reduced, and if so how to reduce it. We additionally would like for everyone to be able to quickly run a couple of the same expirements we have and to observe the overhead for themselves.

Observing the Overhead

For these expirements we will use up to two IMX219-160 cameras, connected to the Jetson NX developer kit’s CSI2 ribbon cable connectors. You should configure the resolution and framerate in the expirements to values supported by your own cameras.

For a single CSI2 camera, you can run the following:

gst-launch-1.0 nvarguscamerasrc sensor-id=0 ! 'video/x-raw(memory:NVMM), width=(int)3280, height=(int)2464, format=(string)NV12, framerate=(fraction)21/1' ! nvvidconv ! queue ! fakesink

To terminate this pipeline you can use CTRL-C in your terminal at any time to send SIGINT.

While this is running open a new terminal and observe the CPU usage of /usr/sbin/nvargus-daemon. One way to do this is to run htop (installed via sudo apt install htop) and click on the CPU column to filter processes by order of CPU load. The nvargus-daemon process should appear at or near the top with around 18% CPU load.

An alternative means of viewing this is to run:

top -p `pgrep "nvargus"`

For two cameras you can run:

gst-launch-1.0 nvarguscamerasrc sensor-id=1 ! 'video/x-raw(memory:NVMM), width=(int)3280, height=(int)2464, format=(string)NV12, framerate=(fraction)21/1' ! nvvidconv ! queue ! fakesink nvarguscamerasrc sensor-id=0 ! 'video/x-raw(memory:NVMM), width=(int)3280, height=(int)2464, format=(string)NV12, framerate=(fraction)21/1' ! nvvidconv ! queue ! fakesink

Observations

For the MODE_10W_4CORE with the clocks like so:

Online CPUs: 0-3
cpu0: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=1190400 CurrentFreq=1190400 IdleStates: C1=1 c6=1
cpu1: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=1190400 CurrentFreq=1190400 IdleStates: C1=1 c6=1
cpu2: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=1190400 CurrentFreq=1190400 IdleStates: C1=1 c6=1
cpu3: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=1190400 CurrentFreq=1190400 IdleStates: C1=1 c6=1
cpu4: Online=0 Governor=schedutil MinFreq=1190400 MaxFreq=1190400 CurrentFreq=1190400 IdleStates: C1=1 c6=1
cpu5: Online=0 Governor=schedutil MinFreq=1190400 MaxFreq=1190400 CurrentFreq=1190400 IdleStates: C1=1 c6=1
GPU MinFreq=114750000 MaxFreq=803250000 CurrentFreq=114750000
EMC MinFreq=204000000 MaxFreq=1600000000 CurrentFreq=1600000000 FreqOverride=0
Fan: PWM=130
NV Power Mode: MODE_10W_4CORE

We observed ~18% CPU utilization for a single camera and up to ~37% with two cameras.

Then we re-ran with the max clock frequency (using sudo jetson_clocks) and got:

SOC family:tegra194  Machine:NVIDIA Jetson Xavier NX Developer Kit
Online CPUs: 0-3
cpu0: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=1190400 CurrentFreq=1190400 IdleStates: C1=0 c6=0
cpu1: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=1190400 CurrentFreq=1190400 IdleStates: C1=0 c6=0
cpu2: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=1190400 CurrentFreq=1190400 IdleStates: C1=0 c6=0
cpu3: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=1190400 CurrentFreq=1190400 IdleStates: C1=0 c6=0
cpu4: Online=0 Governor=schedutil MinFreq=1190400 MaxFreq=1190400 CurrentFreq=1190400 IdleStates: C1=0 c6=0
cpu5: Online=0 Governor=schedutil MinFreq=1190400 MaxFreq=1190400 CurrentFreq=1190400 IdleStates: C1=0 c6=0
GPU MinFreq=803250000 MaxFreq=803250000 CurrentFreq=803250000
EMC MinFreq=204000000 MaxFreq=1600000000 CurrentFreq=1600000000 FreqOverride=1
Fan: PWM=130
NV Power Mode: MODE_10W_4CORE

We saw similar values with the other clock frequency settings.

bmsp · April 26, 2021, 10:06pm

@JerryChang, why do you think the overhead is so high and how can we reduce it?

JerryChang · April 27, 2021, 2:19am

hello bmsp,

due to this topic is filed for Jetson Nano platform.
I would like to ask you to initial another new forum discussion thread for better supports.
you may leave the topic-id here for tracking. thanks

bmsp · May 3, 2021, 7:08pm

Thanks Jerry, I’ve continued this topic over here.

Topic		Replies	Views
High CPU usage streaming from CSI2 cameras on Jetson NX Jetson Xavier NX camera , gstreamer , nvbugs , performance	12	2383	October 18, 2021
Issue with CSI Camera Testing on Jetson Nano - Seeking Assistance Jetson Nano camera	34	1571	August 30, 2023
Jetson Nano shows 100% CPU Usage after 30 minutes with Deepstream-app demo DeepStream SDK	21	6161	October 12, 2021
Jetson Nano ISP functionality performance Jetson Nano camera , nvbugs	9	3539	October 15, 2021
Jetson nano nvv4l2h264enc stop working sometimes Jetson Nano encoder	34	2061	March 2, 2022
Jetson nano padd lt6911uxc hdmi to mip streaming error :video4linux video0: frame start syncpt timeout!0 Jetson Nano camera	9	512	December 19, 2023
Jetson Nano dual camera bug, video1 not capturing data Jetson Nano camera	8	583	October 15, 2021
Very bad performance, or is something wrong? Jetson Nano	10	1055	April 30, 2023
Capture issues on csi Jetson Xavier Nx Jetson AGX Xavier camera , board-design	14	842	November 9, 2022
Gst-launch call with nvarguscamerasrc crashes/reboots Jetson Nano when called multiple times Jetson Nano camera	16	1290	September 21, 2022

Jetson Nano CPU load average for CSI video recording unusually high

Background

Goal

Observing the Overhead

Observations

Related topics