It seems there’s a limit of 2 simultaneous encoding sessions on a card, but I can’t seem to find any mention of how many simultaneous decoders is possible on a single GPU. I have an application where I am receiving encoded frames, over a network connection, from possibly 100’s of cameras (this is a surveillance app). With that in mind, how could I leverage NVidia’s GPU decoding capability to address this problem.
Thanks in advance for any suggestions.
Hi
There is no limit on the number of decoders you can run in parallel. It is limited by availability of system resources.
Thanks
Thank you, Vignesh.
I’ve been using a single decoder for more than a year now with good success. I’m trying to scale up to multiple decoders.
- I use a single cuda context with all of the decoders instantiated inside this context. I do this assuming that only a single context is active at at time, so in order to have any hope of simultaneous execution of kernels or video decoders, they all must be on the same context. Please correct me if I’m wrong.
- I have separate CPU threads running feeding and operating each decoder.
- I have a separate cuda stream created for each decoder, although these streams are only used with my cuda-based image processing on the outputted (decoded) images (i.e. NV12 to ARGB, etc.).
- I’m also displaying the images on a WPF D3DImage…but that’s a whole other set of issues. Working, but performance is worrisome.
Anyway, all that said, I have a couple of questions:
1 - There doesn’t appear to be any use of Streams with the CUvideodecoder, so I’m assuming that actually trying to get multiple decoders on the GPU to simultaneously be decoding is not possible? My hope is that it is possible to simultaneously decode, but since the CUvideodecoder also takes context lock argument in its creation, I wonder is each video decoder locks the entire context (preventing any other decoders from doing anything) while it performs some operations.
2 - is there any way to predict the amount of gpu memory/resources required to run a CUvideodecoder? My intuition tells me it must be related to the size of the decoded image, but its hard to tell.
Sorry for the long question.
Bryan
I use a single cuda context with all of the decoders instantiated inside this context
From the above statement, I am assuming single CUDA context for all decoders inside this “process”. If my understanding is correct, your assumption that you should only have a single context for parallel decoder executions is incorrect. You can have multiple contexts (context per thread). “Context per thread” model can be used to saturate the decode engine. Note again, that video decode engine is completely independent and separate from graphics engine on the GPU and hence the optimization principles for CUDA do not necessarily apply directly for video decoding.
There doesn’t appear to be any use of Streams with the CUvideodecoder, so I’m assuming that actually trying to get multiple decoders on the GPU to simultaneously be decoding is not possible? My hope is that it is possible to simultaneously decode, but since the CUvideodecoder also takes context lock argument in its creation, I wonder is each video decoder locks the entire context (preventing any other decoders from doing anything) while it performs some operations.
As I said above Video Decode is different from kernel executions. The concept of “multiple streams to saturate graphics engine” does not apply to Decode engine per se. But, we can have multiple threads feeding the decoder. For example, if the decoder can decode a single 4k video@60fps then we expect it to decode 4 full-HD (1080p) streams@60fps.
is there any way to predict the amount of gpu memory/resources required to run a CUvideodecoder? My intuition tells me it must be related to the size of the decoded image, but its hard to tell.
GPU memory utilization depends upon resolution of the video to be decoded, besides several other factors. You can use NVML APIs to get the current GPU memory utilization