Jetson Nano CPU load average for CSI video recording unusually high

Hello everyone,
I have recently noticed that the system load average while capturing frames on the Jetson Nano over CSI is unusually high (individual CPU load tends to remain low).
In order to rule out if this was something specific to my design, I ran a few tests on a Jetson Nano Devkit rev a02 with a single imx219 and Jetpack 4.4 (r32.4.2).

My first test was to capture video and send it nowhere using v4l2-ctl.
The commands for this test are the following:
v4l2-ctl -d /dev/video0 -c sensor_mode=2 --set-fmt-video=width=1920,height=1080,pixelformat=“RG10”
v4l2-ctl -d /dev/video0 -c gain=170 -c override_enable=1 -c bypass_mode=0 -c exposure=33333 -c frame_rate=30000000
v4l2-ctl -d /dev/video0 --stream-skip=100 --stream-count=120 --stream-mmap

For this test, the system load stabilizes around 1 after a while (more than 15 minutes).

For the second test, I decided to use gstreamer with nvargussrc.
The command for this one is the following one
gst-launch-1.0 nvarguscamerasrc ! ‘video/x-raw(memory:NVMM), format=NV12, width=1920, height=1080, framerate=30/1’ ! fakesink

For this test, the system load is a more conservative 0.45, with some spikes.

As a third test, we made v4l2-ctl capture a single video stream on our custom designed board and the load once again stabilized around 1.
As a fourth and final test, we made v4l2-ctl capture two video streams on two different cameras using the same custom board, with the load climbing to 2.
From the last two tests, it looks like the load scales linearly with the amount of cameras used.

The problem is, this product is supposed to support four CSI cameras and still be able to do something with them.
At best, this looks like a bug (at least on v4l2) and I’m really surprised no one seems to be bothered.
At worst, this means like I might have to throw away all the time spent working on the Jetson Nano and loose a lot of potential clients.

I hope there is some kind of solution for this issue.

Best Regards,
Juan Pablo.

hello juan.tettamanti,

looks like you’re using top commands to gather the usage.
you should also note that, its by default configure as Irix mode, it expressed as a percentage of total CPU time.

please refer to https://linux.die.net/man/1/top, you may switch Irix mode to off (Solaris mode) by hit “I” (shift + i), where a task’s usage will be divided by the total number of CPUs; such average CPU usage should be more reasonable results,
thanks

Hello @JerryChang,
So, I spent quite a while looking at the forum, the kernel code, using ftrace and a bunch other things.
After quite some time, I remembered that processes in the “uninterruptible sleep” state are considered by the average load metric (or so I had heard, and I confirmed that on a few different sources including the original commit message for that).

All my tracing had already told me that there was a really nice function in vi2_fops.c called tegra_channel_kthread_capture_start running a while loop.
This function was called every time my userspace program went to sleep (select) waiting for a new frame.
Inside this function there was a call to nvhost_syncpt_wait_timeout_ext which is located on nvhost_syncpt.c.
At the end of all this function calls, you can find a wait_event_timeout which changes the kthread state to “uninterruptible sleep”.

Since ftrace already confirmed almost all the time spent in select() closely matched this wait_event_timeout() / nvhost_syncpt_wait_timeout_ext() and I had also confirmed with top that this kthread was in DW (D = uninterruptible sleep) state, I decided to make a very simple change.
I modified the code so that it would enter the interruptible sleep mode and it worked the same, except now the average system load was almost zero.

This issue has given me quite the headache and I think an isr might have been a better choice.

Since I cannot change the way hardware communicates with the kernel, I’d like to know if, according to Nvidia’s experience, I should expect any problems derived from modifying nvhost_syncpt_wait_timeout_ext() so that it calls nvhost_syncpt_wait_timeout() with “interruptible = true”.

Best Regards,
Juan Pablo Tettamanti.

Here you can see the graph for nvhost_syncpt_wait_timeout()

This one is for the high load average with the vi-output kthread in uninterruptible sleep (DW) state

This one is for the negligible load average with the vi-output kthread in interruptible sleep (SW) state

hello juan.tettamanti,

nvhost_syncpt_wait_timeout_ext() will depends-on your CSI sensor frame-rate, you should expect the capture thread to issue a single frame capture request around (1/frame-rate) second.
moreover,
it’s VI2 driver to process Nano’s CSI camera. by default it’s running a single thread approach, which waiting for start-of-frame singling and write the frame into memory buffer. you may enable low_latency flags if you would like to reduce the capture latency.
please note that,
enable low_latency flags by checking end-of-frame signaling for buffer writing, the latency of each capture request should still close to (1/frame-rate) second.
for example,

static int tegra_channel_capture_frame()
{
    ...
        if (chan->low_latency)
                ret = tegra_channel_capture_frame_multi_thread(chan, buf);
        else
                ret = tegra_channel_capture_frame_single_thread(chan, buf);

Hello @JerryChang,
My question was whether changing interruptible from false to true would lead to unexpected issues (like the wait returning before a frame was available) or not.

Best Regards,
Juan Pablo.

Hello @JerryChang,
Unfortunately my previous change didn’t work as intended and it’s still a huge problem to use multiple cameras simultaneously.

For the time being, my proposed change enabled us to run v4l2-ctl for a single camera for several hours without any apparent impact on the average load.
However, when you try to do the same with two cameras, the load will once again go straight to one and the system will start missing deadlines on other critical tasks.

Strangely, this second issue seems to be dependent on how you time the v4l commands and sometimes it won’t happen at all.

Best Regards,
Juan Pablo.

hello juan.tettamanti,

please check Camera Design Guide , multiple camera use-case on Nano could support four 2-lane cameras.
may I know what’s your actual use-case, may I also know what’s your expected preserve bandwidth. had you refer to post #3 to switch to Solaris mode for checking the usage?

BTW,
it’s expect that gstreamer with nvarguscamerasrc taking more resources,
you may refer to Camera Architecture Stack, there’s post-process involved hence it is expected for nvarguscamerasrc taking more resources than v4l2 standard controls.
thanks

Hi @JerryChang,
I have checked your documentation several times since we started our hardware design around October 2019 (and also since you originally released the Nano around April 2019).
I have to say that I’ve found your documents to be unclear and slightly misleading on occasions.
They have also changed a lot, making things more complicated.

Right now I’m at a point where I’m seriously considering canceling the plans for 10k units based on Jetson products since I keep running into software related issues.

I plan to use 3 4-lane cameras, one connected to each 4-lane port since there are no plans to support virtual channels.
Each camera has been configured to use 800Mbps per lane, with a resolution of 1280x1080 pixels at 30fps and 12bpp.

On the software side of things, my use case would involve using simple v4l2 operations to capture the video stream from these cameras simultaneously.
So far, I’ve managed to correct a software (kernel) issue where the driver remained in the uninterruptible sleep state while waiting for a frame.
However, when using two cameras simultaneously, it seems like there is some interaction where the kernel wait on resources.
Even if userspace is not using the cpu at all, having a high load average usually suggests there is a bug of some sort.

Please note that I always refer to “load average” as opposed to “cpu usage” since the cpu is relatively free.

I don’t have any interest in using libargus (either directly or with gstreamer), since it requires adding X11, mesa-3d, opengl and some other heavy files / libraries to my (smaller than 300MB) custom image which I have to update using a cell connection in the middle of nowhere.

Best Regards,
Juan Pablo.

Hi @JerryChang,
I decided to attach a relatively simple test program I made recently.

This one will succeed when using one device.
It will also succeed with multiple usb devices on my computer.

However, when using two csi cameras on the Nano, the second one will return with " video4linux video1: frame start syncpt timeout!"
The error clearly comes from vi2_fops.c yet I’m not sure how to prevent this.

You can compile this with g++ v4l2-test.cpp -o v4l2-test -lv4l2 -lv4lconvert -lpthread

Best Regards,
Juan Pablo.

v4l2-test.cpp (7.8 KB)

Hi @JerryChang,
The main reason for mentioning gstreamer was to point out that I found its behavior more reasonable on the nano and there could be a clue there regarding how to fix my current issue.

Best Regards,
Juan Pablo.

hello juan.tettamanti

thanks for sharing test app to reproduce the issue, we’ll setup an environment on Nano for investigation.

BTW,
I think you’re based-on Jetpack 4.4-DP since your l4t release was l4t-r32.4.2; please refer to JetPack Archive for details.
may I have double confirmation of your JetPack release version.
thanks

Hello @JerryChang ,
For the time being this has been tested on Jetpack 4.4-DP r32.4.2
I’d expect this to happen on other versions but haven’t checked yet

Best Regards,
Juan Pablo.

hello juan.tettamanti

FYI,
we’re able to reproduce the issue locally with two IMX219 on Nano / l4t-r32.4.3.
it failed to access second CSI camera on Nano for dual camera use-case with v4l2 standard controls;
however, dual-camera preview works normally with argus_camera/multi-session mode.

we’re tracking this in the internal bug system, will update the conclusion later.
thanks

hello juan.tettamanti

I’ve test locally, am able to enable dual camera for streaming with v4l2 standard controls,
could you please have a try with below commands,
for example,

$ v4l2-ctl -d /dev/video0 --set-fmt-video=width=1920,height=1080,pixelformat=RG10 --set-ctrl=sensor_mode=0 --stream-mmap --stream-count=1000
$ v4l2-ctl -d /dev/video1 --set-fmt-video=width=1920,height=1080,pixelformat=RG10 --set-ctrl=sensor_mode=0 --stream-mmap --stream-count=1000

here’s reference that it’s worked by enabling dual camera for streaming.

Hi @JerryChang,
After some tests and a suggestion from a coworker, things are working better now.

Changing the driver code to use interruptible sleep fixed the initial load issue with a single camera.
Fixing a seemingly unrelated issue where the frame rate from the dts was not matching the real frame rate from the device removed the load issue with multiple cameras.

The multi-threaded test code still won’t work, but that should be easier to fix ourselves.

Best Regards,
Juan Pablo.

hello juan.tettamanti,

FYI,
you might review Sensor Pixel Clock property settings for your frame-rate issue.