Latency issues when launching kernel in a gstreamer plugin

I’m working on an gstreamer plugin that launch 3 algorithms on each frame received. Those algorithms are executed on different core using POSIX threads and only one of those launch kernel.
When I launch that plugin whith a fakesink it works fine. But when I add an h264 encoder in the GStreamer pipeline the performances of the algorithm using CUDA are greatly impacted (Instead of requiring around 15ms the execution time randomly varies between 15 and 30ms).

I profiled the code and I saw an unexpected lattency between the kernel launches and the actual execution of those kernels. Those lattency can be up to 6ms and seems to happend randomly on certain kernel launch. When those latencies appear the profiler show that the GPU is iddling.

Moreover, when I deactivate the algoritms that doesn’t require CUDA computation, I don’t get those latencies.

So if anyone has ideas on what the problem could be, or of tools that I can use to get more information.
I’m executing the code on a TK1 using CUDA-6.5 and GStreamer-1.0

Hi florian, can you share the pipeline you are running?

The problematic pipeline is :

gst-launch-1.0 videosdisrc ! cudaimageprocessing ! omxh264enc control-rate=constant bitrate=2150000 vbv-size=2 profile=high iframeinterval=180 num-B-Frames=0 low-latency=false SliceIntraRefreshInterval=0 SliceIntraRefreshEnable=false slice-header-spacing=19200 bit-packetization=true ! video/x-h264, stream-format=byte-stream, alignment=au ! udpsink sync=false async=true

videosdisrc grabs frames from a HDSDI source through a PCIe Express FPGA connection.
cudaimageprocessing is a home made CUDA processing element.
Other elements are those of the last L4T.

Next pipeline doesn’t show latencies problems :

gst-launch-1.0 videosdisrc ! cudaimageprocessing ! fakesink sync=false async=true

It seems using H264 encoder slows down or hands GPU for some time ?

Hi florian,
There should be memcpy() between ‘cudaimageprocessing ! omxh264enc’. It has to copy video/x-raw buffers into video/x-raw(memory:NVMM) buffers. This is an issue and we have developed tegra_multimedia_api in later release for TX1/TX2. However, it is still a bottleneck on TK1.

One thing you can try is to run in max performance:
Or try SW encoder x264enc.

There is already a memcpy as a last step in the cudaimageprocessing plugin.

I tried running in max performance and it didn’t solve the issue.

While running this pipeline, I observe that the TK1 power consumption is far from constant. So I’m wondering if for some reason the GPU and/or CPU frequencies are modified.

Do you know a way to be assured that the GPU and CPU frequencies are always setted to the maximum without regards to the current charge?

Hi florian,
You can check the frequency of HW blocks via tegrastats:

ubuntu@tegra-ubuntu:~$ sudo ./tegrastats
[sudo] password for ubuntu:
RAM 457/1892MB (lfb 267x4MB) cpu [0%,off,off,off]@1734 EMC 5%@600 AVP 0%@204 VDE 120 GR3D 0%@72 EDP limit 0

I ran the same pipeline on a TX2 and the problem doesnt seem to occur. IS there a fundamental memory, Gpu or encoder architecture difference between Tk1 and Tx2 to explain that ?

Hi florian,
Please compare tegrastats of TX2 and TK1. The GPU ad CPU are different on the two chips. We should get more information via tegrastats.

tegrastats didnt really help.

we investigated further and changed some cudaMallocManaged allocation to cudaMalloc. calculation are now almost constant. it is better but we thought cudaMallocManaged was optimized for Tk1 and TX2. What is the point ? is there any difference between tk1 and tx2 about this kind of allocation ?

Hi florian,
In hardware, GPU architecture of TK1 and TX2 is different.
In software, CUDA revision of TK1 and TX2 is different.

The implementation of cudaMallocManaged() is different.

Can you be more precise ?

Hi florian,
It is Kepler GPU on TK1 and Pascal GPU on TX2.

I am not able to reveal how cudaMallocManaged() is implemented. If you need more information about how CUDA APIs are implemented, please contact NVIDIA salesperson. See if we can have further cooperation.