Gstreamer lag increases with frame rate recording from Raspberry Pi camera (nvarguscamerasrc and nvivafilter)

I have noticed a massive lag when processing Rapsberry Pi camera frames with CUDA. This makes it almost impossible to control my drone.

This is the gstreamer pipeline I use in my program:

nvarguscamerasrc exposurecompensation=1 gainrange='8 16' ! video/x-raw(memory:NVMM), width=(int)1
280, height=(int)720, format=(string)NV12, framerate=(fraction)30/1 ! nvivafilter cuda-process=true customer-lib-name=libnvsample_cudaprocess.so ! v
ideo/x-raw(memory:NVMM), format=(string)NV12 ! omxh264enc ! qtmux ! filesink location=video.mov

Basically it is a linear pipeline that passes the video through nvivafilter so that I can analyse it in CUDA, and finally it gets saved to disk as H264 video.

I have a test that flashes a light and writes some simple graphics to the current video frame. I am then able to look at the saved video and measure the frame delay between the graphics appearing and the flash.

This is the delay I see at different frame rates:

120ms @ 73fps (target frame rate was 120fps)
50ms @ 60fps
33ms @ 30fps

There is clearly some buffering happening somewhere. It seems perverse that I am having to run my algorithm at the lowest frame rate to get the lowest latency. :-S

Please could someone tell me how to reduce the latency at the highest frame rate?

Thanks!

Is it the same with the following ? :

nvarguscamerasrc ! video/x-raw(memory:NVMM), width=1280, height=720, format=NV12, framerate=30/1 ! nvivafilter cuda-process=true pre-process=false post-process=false customer-lib-name=libnvsample_cudaprocess.so ! video/x-raw(memory:NVMM), format=NV12 ! nvv4l2h264enc ! h264parse ! qtmux ! filesink location=video.mov
1 Like

Hi, thanks for the reply and apologies for my delay in replying.

I see that your pipeline changes 2 things: it uses nvv4l2h264enc and it disables pre/post processing.

I tested these separately and together and measured the lag across 5 separate runs then averaged the result.

Thus, the results for the delay in milliseconds between reality and the frame becoming available in the nvivafilter CUDA code are in this table:

It seems that the encoder type did not matter and neither did setting pre/post process or not. I’d like to add that when I did the runs, there was some variation in the delays; I would say about 25% variation around the mean.

The table makes it very clear that increasing FPS increases the lag.

I welcome any other suggestions to remedy this lag. nvivafilter does not seem to have any “buffered-frames”-type property to set.

Thanks.

I noticed that my CPU core utilisation is as follows:

16% 100% 14% 13%

What if the nvarguscamerasrc element takes more CPU the higher the frame rate? Could this increase the latency?

It is common AFAIK that Argus takes some CPU time (@120 fps, may take one core 100%; maybe more on Nano).

Did you boost clocks with jetson_clocks script ?

Also try boosting clocks for VI/VIC/NVENC such as here for VIC:

2 Likes

File write could be the bottleneck to hurt performance.
Suggest check the performance by log would be great.

qtmux uses a lot of memory building an index table by default. I would try with matroskamux and see if there is any difference.

1 Like

You may rule out container and filesystem with:

gst-launch-1.0 -v nvarguscamerasrc ! video/x-raw(memory:NVMM), width=1280, height=720, format=NV12, framerate=30/1 ! nvivafilter cuda-process=true pre-process=false post-process=false customer-lib-name=libnvsample_cudaprocess.so ! video/x-raw(memory:NVMM), format=NV12 ! nvv4l2h264enc ! h264parse ! fpsdisplaysink video-sink=fakesink text-overlay=0

For measurements, I’d suggest discarding the first 20 frames and then average next 100 frames.
You may also try adding a queue before filesink and see if it helps.

1 Like

If I run jetson_clocks first, the improvement in latency and FPS is significant. Thank you so much for this. Here is a new table averaged over 5 runs when I have run jetson_clocks (no pre/post-process is being specified):

I am now always getting at least 104 FPS, usually >113 FPS when I specify “120/1” in the pipeline (it used to max out at 79 FPS). The test is done as before by flashing an LED using GPIO and measuring the time it takes for the flash to appear in the CUDA process. I process 100 frames before flashing the LED.

Some comments on the latencies when encoding/saving H264 vs not encoding/saving H264:

  • At 120 FPS and encoding/saving H264 the latency varies much more than when not encoding/saving H264. e.g. one run can show 27ms latency and the next shows 95ms.
  • This is concerning but maybe I have to put up with this?
  • Compare this to 60 FPS when the latency is always 22ms and does not vary at all when saving/encoding H264.

As you can see in the table, I did a test with nfs unmounted because usually I have a share on the Jetson mounted remotely but this did not improve the FPS.

N.B. In the table, encoding H264 implicitly means to also save the video to disk. When I am not encoding, my pipeline terminates in the “fakesink” element.

Why does this script make such an amazing difference?

BTW, I would like to be clear that my main problem has always been latency. I am happy that the FPS has been boosted to almost 120 FPS however the latency is much more important to me in this application.

I don’t think that boosting the clocks for VI/VIC/NVENC applies to Jetson Nano because I do not have the relevant sysfs nodes:

$ sudo find /sys/kernel/debug/ -name "*vic*"                 
/sys/kernel/debug/ieee80211/phy0/netdev:p2p-dev-wlan0/iwlmvm/os_device_timediff
/sys/kernel/debug/ieee80211/phy0/netdev:wlan0/iwlmvm/os_device_timediff
/sys/kernel/debug/pg_domains/vic
/sys/kernel/debug/clk/vic03
/sys/kernel/debug/clk/vic03.cbus
/sys/kernel/debug/clk/vic.floor.cbus
/sys/kernel/debug/pcie/list_devices
/sys/kernel/debug/vic
/sys/kernel/debug/tracing/events/cfg80211/rdev_start_p2p_device
/sys/kernel/debug/tracing/events/cfg80211/rdev_stop_p2p_device
/sys/kernel/debug/tracing/events/iommu/add_device_to_group
/sys/kernel/debug/tracing/events/iommu/remove_device_from_group
/sys/kernel/debug/tracing/events/iommu/attach_device_to_domain
/sys/kernel/debug/tracing/events/iommu/detach_device_from_domain
/sys/kernel/debug/tracing/events/random/add_device_randomness
/sys/kernel/debug/tracing/events/nvhost/nvhost_vm_init_device
/sys/kernel/debug/tracing/events/ext4/ext4_evict_inode
/sys/kernel/debug/tracing/events/power/device_pm_callback_start
/sys/kernel/debug/tracing/events/power/device_pm_callback_end
/sys/kernel/debug/usb/devices
/sys/kernel/debug/70019000.iommu/as000/54340000.vic
/sys/kernel/debug/70019000.iommu/masters/54340000.vic
/sys/kernel/debug/pinctrl/pinctrl-devices

Hi,
Please execute the steps to run system at maximum performance mode. And check if there is improvement:

  1. Run $ sudo nvpmodel -m 0 and $ sudo jetson_clocks
  2. Set the property to hardware encoder:
  maxperf-enable      : Enable or Disable Max Performance mode
                        flags: readable, writable, changeable only in NULL or READY state
                        Boolean. Default: false
  1. Enable VIC engine at maximum clock:
    Nvvideoconvert issue, nvvideoconvert in DS4 is better than Ds5? - #3 by DaneLLL

You can check CPU/GPU/NVENC status by executing sudo tegrastats

1 Like
  1. I already have nvpmodel set to 0. jetson_clocks has been run and given a fairly big improvement.

  2. I see that there is no such property for omxh264enc so I assume you meant nvv4l2h264enc.

  3. I followed the link and as the commands don’t translate exactly to the Jetson Nano, I used the below but they don’t work either:

$ cat /sys/devices/50000000.host1x/54340000.vic/power/control
auto
$ sudo echo on > /sys/devices/50000000.host1x/54340000.vic/power/control
-bash: /sys/devices/50000000.host1x/54340000.vic/power/control: Permission denied

$ cat /sys/devices/50000000.host1x/54340000.vic/devfreq/54340000.vic/governor
wmark_active
$ sudo echo userspace > /sys/devices/50000000.host1x/54340000.vic/devfreq/54340000.vic/governor
-bash: /sys/devices/50000000.host1x/54340000.vic/devfreq/54340000.vic/governor: Permission denied

$ cat /sys/devices/50000000.host1x/54340000.vic/devfreq/54340000.vic/available_frequencies
192000000 307200000 345600000 409600000 486400000 524800000 550400000 576000000 588800000 614400000 614400000 627200000
$ sudo echo 627200000 > /sys/devices/50000000.host1x/54340000.vic/devfreq/54340000.vic/max_freq
-bash: /sys/devices/50000000.host1x/54340000.vic/devfreq/54340000.vic/max_freq: Permission denied

$ ls /sys/devices/50000000.host1x/54340000.vic/devfreq/54340000.vic/
available_frequencies  available_governors  cur_freq  device  governor  max_freq  min_freq  polling_interval  power  subsystem  target_freq  trans_stat  uevent
$ sudo echo 627200000 > /sys/devices/50000000.host1x/54340000.vic/devfreq/54340000.vic/target_freq
-bash: /sys/devices/50000000.host1x/54340000.vic/devfreq/54340000.vic/target_freq: Permission denied

Thus, here are the results so far (first 2 columns are just for control):

Hi,
Looks like yo fail to set VIC to maximum clock. Please run $ sudo su to enter supper user and try the commands. See if this works.

And please boost the clock of NVCSI and ISP engines:
Jetson/l4t/Camera BringUp - eLinux.org

1 Like

Firstly, I’d like to clarify that at 60fps, the latency is always 1 frame or less so all tests from now on relate to 120fps where the latency is rarely 1 frame and always 1-10 frames (it varies between separate runs but I do not present that information here).

Running as su (not sudo) worked mostly, however this failed:

$ echo 627200000 > /sys/devices/50000000.host1x/54340000.vic/devfreq/54340000.vic/target_freq 
bash: /sys/devices/50000000.host1x/54340000.vic/devfreq/54340000.vic/target_freq: Permission denied

On Jetson Nano it seems that NVCSI/ISP engines cannot be boosted as this directory does not exist: “/sys/kernel/debug/bpmp/” and there are very few sysfs filenames containing “bpmp” and none that are in debugfs. All I found were these:

$ ls /sys/kernel/debug/clk/isp/
clk_accuracy  clk_enable_count  clk_flags  clk_notifier_count  clk_parent  clk_phase  clk_possible_parents  clk_prepare_count  clk_rate  clk_state  clk_update_rate  dvfs_freq_offs  frequency_stats_table
$ ls /sys/kernel/debug/clk/csi
clk_accuracy  clk_enable_count  clk_flags  clk_notifier_count  clk_parent  clk_phase  clk_prepare_count  clk_rate  clk_state  clk_update_rate  dvfs_freq_offs  frequency_stats_table

But I don’t see how these would relate to the link you gave.

Given this, these are the tests over averages of 10 runs:

Ultimately, having a variable frame latency at 120fps from reality to the CUDA algorithm of 1-10 frames makes control difficult. At the moment I have 2 choices:

  1. Run at 60fps when the latency is always 22ms and I can save h264 for the debugging purposes I need it for,
  2. Run at 120fps without saving h264 and accept a variation in latency of 20-40ms. (If saving h264, the variation in latency is much greater and the average latency is about 50% higher)

To be clear, I should now be okay with my algorithm but any further reductions of latency (and reductions in variation of latency) are welcome. :-)

I realise I never replied to this.

The answer is, with H264 encoding and fakesink there is no improvement in FPS or latency. I also tried writing the encoded file to a ramdisk but no improvement there either.

Hi,
Do you use Jetpack 4.6.2 or 4.6.3? Not sure if you use latest release.

1 Like

I think I am using 4.6.3 because I have L4T version 32.6.1.

Hi,
r32.6.1 is Jetpack 4.6. Is it possible to upgrade to later release and try?

Ah, I beg your pardon.

The latest Jetpack release is 4.6.3/L4T 32.7.3 but the only changes there are security fixes which would not improve performance.

Let me ask you and Nvidia a question:

Please could nvidia provide nvargussrc with a reduced or customisable buffer queue size?

According to this post Nvidia has done that in the past with a similar plugin:

It would greatly improve my product if you could do this because at present the nvargussrc gstreamer plugin is poorly optimised.

Hi,
There are buffers in Argus stack for capturing Bayer frames and then queue in ISP engine to output YUV frames. It is minimum buffer number in current implementation which is tested and verified in SQA tests. Reducing the number may impact system stability. It is fixed value and not able to be customized.

Please share the gstreamer commands and the steps for checking latency. So that we can set up and try to replicate the issue on Jetson Nano+Raspberry Pi camera V2. And then check with our teams.

1 Like