Hello,
When running a video model training pipeline, I use GPU video decoding (decoding ~16 frames at constant FPS). Decoding is performed with torchcodec, which calls cuvidMapVideoFrame internally. This results in thousands of kernels ConvertNV12BLtoNV12 being launched in parallel (in a separate stream) to the compiled forward / backward model pass. The overall performance is similar or worse than when using CPU video decoding. It improves when reducing the number of decoder threads and the prefetch factor, which indicates SM pressure.
I’d like to know whether there are means of batching ConvertNV12BLtoNV12 kernels to eliminate kernel launch overhead. I don’t see such a possibility on the level of the API.