Expected performance gain

Hi Devs))

I’m trying to compare the performance of FFMPEG compiled with NVDEC versus NVDEC used directly with CUDA.
One of the differences I’ve noticed is that using directly NVDEC approach eliminates the Device to Host data transfer which is a costly transfer.
Another difference while using the FFMPEG approach is that ‘cuvidParseVideoData’ and ‘cuvidMapVideoFrame’ are executed in the same thread (opposite to what is recommended in the documentation, since ‘cuvidMapVideoFrame’ will block the execution) .
So eventually I was expecting a much higher performance gain while using CUDA.
What I actually experience is approx. 2 times faster rendering.

What differences I observe:

  1. GPU frame duration using CUDA approach is ~4 times less.
  2. The latency of CUDA API calls (such as call to ConvertNV12BLtoNV12) is of magnitude higher with CUDA approach - why is that?
  3. cuStreamSynchronize takes much more time with CUDA approach.

Attached two reports of Nsight to demonstrate the case-

compare_reports.zip (908.4 KB)