I’m working on an gstreamer plugin that launch 3 algorithms on each frame received. Those algorithms are executed on different core using POSIX threads and only one of those launch kernel.
When I launch that plugin whith a fakesink it works fine. But when I add an h264 encoder in the GStreamer pipeline the performances of the algorithm using CUDA are greatly impacted (Instead of requiring around 15ms the execution time randomly varies between 15 and 30ms).
I profiled the code and I saw an unexpected lattency between the kernel launches and the actual execution of those kernels. Those lattency can be up to 6ms and seems to happend randomly on certain kernel launch. When those latencies appear the profiler show that the GPU is iddling.
Moreover, when I deactivate the algoritms that doesn’t require CUDA computation, I don’t get those latencies.
So if anyone has ideas on what the problem could be, or of tools that I can use to get more information.
I’m executing the code on a TK1 using CUDA-6.5 and GStreamer-1.0
videosdisrc grabs frames from a HDSDI source through a PCIe Express FPGA connection.
cudaimageprocessing is a home made CUDA processing element.
Other elements are those of the last L4T.
Hi florian,
There should be memcpy() between ‘cudaimageprocessing ! omxh264enc’. It has to copy video/x-raw buffers into video/x-raw(memory:NVMM) buffers. This is an issue and we have developed tegra_multimedia_api in later release for TX1/TX2. However, it is still a bottleneck on TK1.
There is already a memcpy as a last step in the cudaimageprocessing plugin.
I tried running in max performance and it didn’t solve the issue.
While running this pipeline, I observe that the TK1 power consumption is far from constant. So I’m wondering if for some reason the GPU and/or CPU frequencies are modified.
Do you know a way to be assured that the GPU and CPU frequencies are always setted to the maximum without regards to the current charge?
I ran the same pipeline on a TX2 and the problem doesnt seem to occur. IS there a fundamental memory, Gpu or encoder architecture difference between Tk1 and Tx2 to explain that ?
we investigated further and changed some cudaMallocManaged allocation to cudaMalloc. calculation are now almost constant. it is better but we thought cudaMallocManaged was optimized for Tk1 and TX2. What is the point ? is there any difference between tk1 and tx2 about this kind of allocation ?
Hi florian,
It is Kepler GPU on TK1 and Pascal GPU on TX2.
I am not able to reveal how cudaMallocManaged() is implemented. If you need more information about how CUDA APIs are implemented, please contact NVIDIA salesperson. See if we can have further cooperation.