Sample AppDecMultiFiles in VideoCodec SDK does not improve the performance

there is a sample in Video_Codec_SDK_9.1.23/Samples/AppDecode/AppDecMultiFiles

in the sample, each decoder runs on its own thread and cuda stream.

To the best of my knowledge, the executions of different cuda streams are asynchronous.

Since each decoder runs on different thread, the execution time of multiple decoder is expected as near to singe decoder.

But from my experiment, I found the execution time is grown linearly as the decoder increases.

for example, running on a single decoder on a video requires about one second, but running 8 decoders on a same video requires about 10 seconds.

I am not sure if the cuda streams are really asynchronous.

any idea?

Decoding is not running on the compute engines (“CUDA engines”) / shader cores to start with. Your performance is primarily constrained by the performance of the video decode engine which by itself doesn’t interfere with compute workload, other than by potential memory bandwidth bottlenecks.

Where CUDA is the then involved, is in the “cuvidMapVideoFrame64” call. That call is using a CUDA kernel to copy from the cuvid-internal decoded surfaces (which are in nv24 format, full stream resolution and not cropped, but that’s not relevant as you can’t see them unless you use e.g. DX12 instead) into the target surface (which is in nv12 format), and while at it performs necessary cropping, scaling and de-interlacing as defined by the passed in decoder configuration and the parameters passed to “cuvidMapVideoFrame64”.

That is the point where you can get efficiency gains by multiple CUDA streams, if you have further CUDA kernels to launch, which you may then insert into the same cuda stream on which you performed “cuvidMapVideoFrame64” without further explicit synchronization. Some minor efficiency gains can be achieved there, as kernel launches, cuda memcpys etc. recorded to different cuda streams may be launched with partial overlap, for permanent full device utilization, while kernel launches on the same CUDA stream must always completely finish before the next may start, resulting in the GPU spending most time half-idle despite enqueued work.

Then there is also something the sample isn’t stating explicitly, and that is an implementation detail of the Turing GPUs T4, RTX4000 and RTX5000. They have two instead of just one decode engine. Due to a typically sequential data dependency between frames of the same video stream, it’s not possible to load-distribute decoding for a single stream across multiple engines (well, it actually is for some sequence formats, but NVidias implementation isn’t sophisticated enough to utilize that). With a second video stream, these GPUs achieve a 200% scaling in decode throughput. Just a 200% scaling though, not more than that.

Aside from these 3 special GPUs, all other GPUs of the same generation, regardless of number of shader cores etc., will all achieve only a 100% scaling in throughput, and will also (hard memory bandwidth constraints aside, e.g. a GT1030 has not enough memory bandwidth to match the decode engine!) all have pretty much the same peak throughput.