Jetson Nano CPU load average for CSI video recording unusually high

hello juan.tettamanti,

looks like you’re using top commands to gather the usage.
you should also note that, its by default configure as Irix mode, it expressed as a percentage of total CPU time.

please refer to [url]https://linux.die.net/man/1/top[/url], you may switch Irix mode to off (Solaris mode) by hit “I” (shift + i), where a task’s usage will be divided by the total number of CPUs; such average CPU usage should be more reasonable results,
thanks

Hello @JerryChang,
So, I spent quite a while looking at the forum, the kernel code, using ftrace and a bunch other things.
After quite some time, I remembered that processes in the “uninterruptible sleep” state are considered by the average load metric (or so I had heard, and I confirmed that on a few different sources including the original commit message for that).

All my tracing had already told me that there was a really nice function in vi2_fops.c called tegra_channel_kthread_capture_start running a while loop.
This function was called every time my userspace program went to sleep (select) waiting for a new frame.
Inside this function there was a call to nvhost_syncpt_wait_timeout_ext which is located on nvhost_syncpt.c.
At the end of all this function calls, you can find a wait_event_timeout which changes the kthread state to “uninterruptible sleep”.

Since ftrace already confirmed almost all the time spent in select() closely matched this wait_event_timeout() / nvhost_syncpt_wait_timeout_ext() and I had also confirmed with top that this kthread was in DW (D = uninterruptible sleep) state, I decided to make a very simple change.
I modified the code so that it would enter the interruptible sleep mode and it worked the same, except now the average system load was almost zero.

This issue has given me quite the headache and I think an isr might have been a better choice.

Since I cannot change the way hardware communicates with the kernel, I’d like to know if, according to Nvidia’s experience, I should expect any problems derived from modifying nvhost_syncpt_wait_timeout_ext() so that it calls nvhost_syncpt_wait_timeout() with “interruptible = true”.

Best Regards,
Juan Pablo Tettamanti.

Here you can see the graph for nvhost_syncpt_wait_timeout()

This one is for the high load average with the vi-output kthread in uninterruptible sleep (DW) state

This one is for the negligible load average with the vi-output kthread in interruptible sleep (SW) state

hello juan.tettamanti,

nvhost_syncpt_wait_timeout_ext() will depends-on your CSI sensor frame-rate, you should expect the capture thread to issue a single frame capture request around (1/frame-rate) second.
moreover,
it’s VI2 driver to process Nano’s CSI camera. by default it’s running a single thread approach, which waiting for start-of-frame singling and write the frame into memory buffer. you may enable low_latency flags if you would like to reduce the capture latency.
please note that,
enable low_latency flags by checking end-of-frame signaling for buffer writing, the latency of each capture request should still close to (1/frame-rate) second.
for example,

static int tegra_channel_capture_frame()
{
    ...
        if (chan->low_latency)
                ret = tegra_channel_capture_frame_multi_thread(chan, buf);
        else
                ret = tegra_channel_capture_frame_single_thread(chan, buf);

Hello @JerryChang,
My question was whether changing interruptible from false to true would lead to unexpected issues (like the wait returning before a frame was available) or not.

Best Regards,
Juan Pablo.

Hello @JerryChang,
Unfortunately my previous change didn’t work as intended and it’s still a huge problem to use multiple cameras simultaneously.

For the time being, my proposed change enabled us to run v4l2-ctl for a single camera for several hours without any apparent impact on the average load.
However, when you try to do the same with two cameras, the load will once again go straight to one and the system will start missing deadlines on other critical tasks.

Strangely, this second issue seems to be dependent on how you time the v4l commands and sometimes it won’t happen at all.

Best Regards,
Juan Pablo.

hello juan.tettamanti,

please check Camera Design Guide , multiple camera use-case on Nano could support four 2-lane cameras.
may I know what’s your actual use-case, may I also know what’s your expected preserve bandwidth. had you refer to post #3 to switch to Solaris mode for checking the usage?

BTW,
it’s expect that gstreamer with nvarguscamerasrc taking more resources,
you may refer to Camera Architecture Stack, there’s post-process involved hence it is expected for nvarguscamerasrc taking more resources than v4l2 standard controls.
thanks

Hi @JerryChang,
I have checked your documentation several times since we started our hardware design around October 2019 (and also since you originally released the Nano around April 2019).
I have to say that I’ve found your documents to be unclear and slightly misleading on occasions.
They have also changed a lot, making things more complicated.

Right now I’m at a point where I’m seriously considering canceling the plans for 10k units based on Jetson products since I keep running into software related issues.

I plan to use 3 4-lane cameras, one connected to each 4-lane port since there are no plans to support virtual channels.
Each camera has been configured to use 800Mbps per lane, with a resolution of 1280x1080 pixels at 30fps and 12bpp.

On the software side of things, my use case would involve using simple v4l2 operations to capture the video stream from these cameras simultaneously.
So far, I’ve managed to correct a software (kernel) issue where the driver remained in the uninterruptible sleep state while waiting for a frame.
However, when using two cameras simultaneously, it seems like there is some interaction where the kernel wait on resources.
Even if userspace is not using the cpu at all, having a high load average usually suggests there is a bug of some sort.

Please note that I always refer to “load average” as opposed to “cpu usage” since the cpu is relatively free.

I don’t have any interest in using libargus (either directly or with gstreamer), since it requires adding X11, mesa-3d, opengl and some other heavy files / libraries to my (smaller than 300MB) custom image which I have to update using a cell connection in the middle of nowhere.

Best Regards,
Juan Pablo.

Hi @JerryChang,
I decided to attach a relatively simple test program I made recently.

This one will succeed when using one device.
It will also succeed with multiple usb devices on my computer.

However, when using two csi cameras on the Nano, the second one will return with " video4linux video1: frame start syncpt timeout!"
The error clearly comes from vi2_fops.c yet I’m not sure how to prevent this.

You can compile this with g++ v4l2-test.cpp -o v4l2-test -lv4l2 -lv4lconvert -lpthread

Best Regards,
Juan Pablo.

v4l2-test.cpp (7.8 KB)

Hi @JerryChang,
The main reason for mentioning gstreamer was to point out that I found its behavior more reasonable on the nano and there could be a clue there regarding how to fix my current issue.

Best Regards,
Juan Pablo.

hello juan.tettamanti

thanks for sharing test app to reproduce the issue, we’ll setup an environment on Nano for investigation.

BTW,
I think you’re based-on Jetpack 4.4-DP since your l4t release was l4t-r32.4.2; please refer to JetPack Archive for details.
may I have double confirmation of your JetPack release version.
thanks

Hello @JerryChang ,
For the time being this has been tested on Jetpack 4.4-DP r32.4.2
I’d expect this to happen on other versions but haven’t checked yet

Best Regards,
Juan Pablo.

hello juan.tettamanti

FYI,
we’re able to reproduce the issue locally with two IMX219 on Nano / l4t-r32.4.3.
it failed to access second CSI camera on Nano for dual camera use-case with v4l2 standard controls;
however, dual-camera preview works normally with argus_camera/multi-session mode.

we’re tracking this in the internal bug system, will update the conclusion later.
thanks

hello juan.tettamanti

I’ve test locally, am able to enable dual camera for streaming with v4l2 standard controls,
could you please have a try with below commands,
for example,

$ v4l2-ctl -d /dev/video0 --set-fmt-video=width=1920,height=1080,pixelformat=RG10 --set-ctrl=sensor_mode=0 --stream-mmap --stream-count=1000
$ v4l2-ctl -d /dev/video1 --set-fmt-video=width=1920,height=1080,pixelformat=RG10 --set-ctrl=sensor_mode=0 --stream-mmap --stream-count=1000

here’s reference that it’s worked by enabling dual camera for streaming.

Hi @JerryChang,
After some tests and a suggestion from a coworker, things are working better now.

Changing the driver code to use interruptible sleep fixed the initial load issue with a single camera.
Fixing a seemingly unrelated issue where the frame rate from the dts was not matching the real frame rate from the device removed the load issue with multiple cameras.

The multi-threaded test code still won’t work, but that should be easier to fix ourselves.

Best Regards,
Juan Pablo.

hello juan.tettamanti,

FYI,
you might review Sensor Pixel Clock property settings for your frame-rate issue.

We are also observing very high CPU usage for just simply streaming two CSI2 cameras.

Background

This forum has previously discussed unexplainable and concerningly high CPU loads when using the Nvidia Argus CSI2 Deepstream plugin (nvarguscamerasrc).

This reply aims to provide a concrete example that anyone with a Jetson device (we’re testing on Jetson NX) can run to see the high CPU usage for themselves.

Goal

This post is seeking to understand why the overhead is so high, if it can be reduced, and if so how to reduce it. We additionally would like for everyone to be able to quickly run a couple of the same expirements we have and to observe the overhead for themselves.

Observing the Overhead

For these expirements we will use up to two IMX219-160 cameras, connected to the Jetson NX developer kit’s CSI2 ribbon cable connectors. You should configure the resolution and framerate in the expirements to values supported by your own cameras.

For a single CSI2 camera, you can run the following:

gst-launch-1.0 nvarguscamerasrc sensor-id=0 ! 'video/x-raw(memory:NVMM), width=(int)3280, height=(int)2464, format=(string)NV12, framerate=(fraction)21/1' ! nvvidconv ! queue ! fakesink

To terminate this pipeline you can use CTRL-C in your terminal at any time to send SIGINT.

While this is running open a new terminal and observe the CPU usage of /usr/sbin/nvargus-daemon. One way to do this is to run htop (installed via sudo apt install htop) and click on the CPU column to filter processes by order of CPU load. The nvargus-daemon process should appear at or near the top with around 18% CPU load.

An alternative means of viewing this is to run:

top -p `pgrep "nvargus"`

For two cameras you can run:

gst-launch-1.0 nvarguscamerasrc sensor-id=1 ! 'video/x-raw(memory:NVMM), width=(int)3280, height=(int)2464, format=(string)NV12, framerate=(fraction)21/1' ! nvvidconv ! queue ! fakesink nvarguscamerasrc sensor-id=0 ! 'video/x-raw(memory:NVMM), width=(int)3280, height=(int)2464, format=(string)NV12, framerate=(fraction)21/1' ! nvvidconv ! queue ! fakesink

Observations

For the MODE_10W_4CORE with the clocks like so:

Online CPUs: 0-3
cpu0: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=1190400 CurrentFreq=1190400 IdleStates: C1=1 c6=1
cpu1: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=1190400 CurrentFreq=1190400 IdleStates: C1=1 c6=1
cpu2: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=1190400 CurrentFreq=1190400 IdleStates: C1=1 c6=1
cpu3: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=1190400 CurrentFreq=1190400 IdleStates: C1=1 c6=1
cpu4: Online=0 Governor=schedutil MinFreq=1190400 MaxFreq=1190400 CurrentFreq=1190400 IdleStates: C1=1 c6=1
cpu5: Online=0 Governor=schedutil MinFreq=1190400 MaxFreq=1190400 CurrentFreq=1190400 IdleStates: C1=1 c6=1
GPU MinFreq=114750000 MaxFreq=803250000 CurrentFreq=114750000
EMC MinFreq=204000000 MaxFreq=1600000000 CurrentFreq=1600000000 FreqOverride=0
Fan: PWM=130
NV Power Mode: MODE_10W_4CORE

We observed ~18% CPU utilization for a single camera and up to ~37% with two cameras.

Then we re-ran with the max clock frequency (using sudo jetson_clocks) and got:

SOC family:tegra194  Machine:NVIDIA Jetson Xavier NX Developer Kit
Online CPUs: 0-3
cpu0: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=1190400 CurrentFreq=1190400 IdleStates: C1=0 c6=0
cpu1: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=1190400 CurrentFreq=1190400 IdleStates: C1=0 c6=0
cpu2: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=1190400 CurrentFreq=1190400 IdleStates: C1=0 c6=0
cpu3: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=1190400 CurrentFreq=1190400 IdleStates: C1=0 c6=0
cpu4: Online=0 Governor=schedutil MinFreq=1190400 MaxFreq=1190400 CurrentFreq=1190400 IdleStates: C1=0 c6=0
cpu5: Online=0 Governor=schedutil MinFreq=1190400 MaxFreq=1190400 CurrentFreq=1190400 IdleStates: C1=0 c6=0
GPU MinFreq=803250000 MaxFreq=803250000 CurrentFreq=803250000
EMC MinFreq=204000000 MaxFreq=1600000000 CurrentFreq=1600000000 FreqOverride=1
Fan: PWM=130
NV Power Mode: MODE_10W_4CORE

We saw similar values with the other clock frequency settings.

@JerryChang, why do you think the overhead is so high and how can we reduce it?

hello bmsp,

due to this topic is filed for Jetson Nano platform.
I would like to ask you to initial another new forum discussion thread for better supports.
you may leave the topic-id here for tracking. thanks

Thanks Jerry, I’ve continued this topic over here.