Sample AppDecMultiFiles in VideoCodec SDK does not improve the performance

OnePieceOfDeepLearning · December 5, 2019, 12:06pm

there is a sample in Video_Codec_SDK_9.1.23/Samples/AppDecode/AppDecMultiFiles

in the sample, each decoder runs on its own thread and cuda stream.

To the best of my knowledge, the executions of different cuda streams are asynchronous.

Since each decoder runs on different thread, the execution time of multiple decoder is expected as near to singe decoder.

But from my experiment, I found the execution time is grown linearly as the decoder increases.

for example, running on a single decoder on a video requires about one second, but running 8 decoders on a same video requires about 10 seconds.

I am not sure if the cuda streams are really asynchronous.

any idea?

Ext3h · December 6, 2019, 6:24pm

Decoding is not running on the compute engines (“CUDA engines”) / shader cores to start with. Your performance is primarily constrained by the performance of the video decode engine which by itself doesn’t interfere with compute workload, other than by potential memory bandwidth bottlenecks.

Where CUDA is the then involved, is in the “cuvidMapVideoFrame64” call. That call is using a CUDA kernel to copy from the cuvid-internal decoded surfaces (which are in nv24 format, full stream resolution and not cropped, but that’s not relevant as you can’t see them unless you use e.g. DX12 instead) into the target surface (which is in nv12 format), and while at it performs necessary cropping, scaling and de-interlacing as defined by the passed in decoder configuration and the parameters passed to “cuvidMapVideoFrame64”.

That is the point where you can get efficiency gains by multiple CUDA streams, if you have further CUDA kernels to launch, which you may then insert into the same cuda stream on which you performed “cuvidMapVideoFrame64” without further explicit synchronization. Some minor efficiency gains can be achieved there, as kernel launches, cuda memcpys etc. recorded to different cuda streams may be launched with partial overlap, for permanent full device utilization, while kernel launches on the same CUDA stream must always completely finish before the next may start, resulting in the GPU spending most time half-idle despite enqueued work.

Then there is also something the sample isn’t stating explicitly, and that is an implementation detail of the Turing GPUs T4, RTX4000 and RTX5000. They have two instead of just one decode engine. Due to a typically sequential data dependency between frames of the same video stream, it’s not possible to load-distribute decoding for a single stream across multiple engines (well, it actually is for some sequence formats, but NVidias implementation isn’t sophisticated enough to utilize that). With a second video stream, these GPUs achieve a 200% scaling in decode throughput. Just a 200% scaling though, not more than that.

Aside from these 3 special GPUs, all other GPUs of the same generation, regardless of number of shader cores etc., will all achieve only a 100% scaling in throughput, and will also (hard memory bandwidth constraints aside, e.g. a GT1030 has not enough memory bandwidth to match the decode engine!) all have pretty much the same peak throughput.

Topic		Replies	Views
How to enable multi streams when I use codec to decode multiple videos? Video Processing & Optical Flow	1	953	May 14, 2018
decode video on multiple graphics cards CUDA Programming and Performance	0	549	August 15, 2013
Muliptle streams don't speed up processing CUDA Programming and Performance	4	1839	June 17, 2017
How to decode multiple videos concurrently with NVENC? GPU-Accelerated Libraries	0	667	February 27, 2019
Decoding multiple h264 video streams using single decoder Video Processing & Optical Flow	4	760	July 1, 2019
Number of simultaneous video decoders General Topics and Other SDKs	3	4868	April 4, 2017
How to decode multiple videos concurrently with NVENC? General Topics and Other SDKs	1	695	August 14, 2019
How to solve low fps when I load multiple videos(6 videos, 2048x2048 192MB per video)? Video Processing & Optical Flow	3	794	May 13, 2018
CUDA Decoder API multi-stream limitation? CUDA Programming and Performance	3	3450	May 20, 2010
Choosing the right card for multiple video decoding Video Processing & Optical Flow cuda	5	2584	November 14, 2024

Sample AppDecMultiFiles in VideoCodec SDK does not improve the performance

Related topics