NVDEC: batch video decoder kernels

Hello,

When running a video model training pipeline, I use GPU video decoding (decoding ~16 frames at constant FPS). Decoding is performed with torchcodec, which calls cuvidMapVideoFrame internally. This results in thousands of kernels ConvertNV12BLtoNV12 being launched in parallel (in a separate stream) to the compiled forward / backward model pass. The overall performance is similar or worse than when using CPU video decoding. It improves when reducing the number of decoder threads and the prefetch factor, which indicates SM pressure.

I’d like to know whether there are means of batching ConvertNV12BLtoNV12 kernels to eliminate kernel launch overhead. I don’t see such a possibility on the level of the API.

HI @vkhalidov1, welcome to the NVIDIA developer forums.

I moved your post to CUDA programming for now, I think that is suited best for you specific question. If not, then Video Processing & Optical Flow - NVIDIA Developer Forums would be my next suggestion.

Thanks!