Fastest solution to present decoded frames with OpenGL with NVDEC

What would be the fastest solution to get a decoded picture (from NVDEC) into a texture that can be rendered with OpenGL?

The FramePresenterGL.h example from the video-skd-samples, uses a PBO for the OpenGL/CUDA interop. Then it copies the CUDA device buffer into the PBO (= first copy), then it uses glTexSubImage2D() to copy the data from the PBO into the texture (= second copy).

Is this the best we can do? Or can we maybe skip one copy? Maybe there are platform specific solutions?

1 Like

I am interested in this too. Zero-copy access to the decoded surface should be physically possible since the resource is in the global memory right? Of course the user would have to ensure that the frames in DPB are not replaced by newly decoded ones. We would need something similar to cuGraphicsResourceGetMappedPointer but to work the other way around. To make the CUgraphicsResource from CUdeviceptr.

I would also love to know if the mapped decoded frame can be directly used in kernels (Simple casting can be used like here?) or as the input for FRUC library (possible duplicate).

For reference…I’ve been experimenting with FFmpeg and GPU accelerated decoding with VDPAU where I decoded the frame but had to convert it with VDP Video Mixer and pass it to GL using VDAPU Interop Extension. I believe that at least one copy happened in the mixer.

A zero-copy access should be possible in Vulkan.

Also good article here.