Sharing the same Cuda context for encoding(NVENC) and decoding(NVDEC)

gjakop · March 19, 2018, 4:21am

We have a program with encoding and decoding capabilities. When it was initially implemented we checked ffmpeg implementation and implemented the same approach - use individual cuda context for each decoding and encoding session. This approach works fine in case of ffmpeg as user usually run several ffmpeg for different stream and it looks like nvidia driver works fine in case if multiple contexts created from different processes. In our case we have only one process and faced with the case that multiple contexts created slowly and we were not able to create more that 20 contexts in our application even on powerful GPU like Tesla P100.
I checked different applications and stackoverflow posts and found that some developers use only one context for both encoding and decoding. I checked this approach and it works fine to me:

create context cuCtxCreate
detach context from current thread cuCtxPopCurrent
set the context as current for every thread before use it via cuCtxSetCurrent
use the context in all encoding and decoding thread
use cuvidCreateDecoder to create decoder or
use NV_ENC_OPEN_ENCODE_SESSION_EX_PARAMS.device = the_context; nvEncOpenEncodeSessionEx to create encoder.

Everything works fine and this approach looks just brilliant. I just need nvidia|experts confirmation that it’s legal approach and not a side effect. Please note that without cuCtxSetCurrent call encoder will fail on second encoder session. So this call required even though context pushed in “device” parameter of NV_ENC_OPEN_ENCODE_SESSION_EX_PARAMS during encoding session creation nvEncOpenEncodeSessionEx. So I think cuCtxSetCurrent is actually mark context as sharable because in case if we have one context-one encoding session this call not required. But I cannot find any confirmation in docs.

There are also cuvidCtxLockCreate, cuvidCtxLock, cuvidCtxUnlock functions in SDK used for “Context-locking: to facilitate multi-threaded implementations” according to description. These function are not deprecated but I don’t use them in my current implementation and everything is fine. Should I use them or cuCtxSetCurrent is enough for both encoding threads and decoding threads.
Thanks a lot !

gjakop · March 27, 2018, 4:03pm

Nvidia support/engineering team,
Could you please review provided description and reply?
I did all possible research and read all available docs and was not able to find an answer.

dumbdog · December 25, 2018, 7:28am

Any news on that?
Do you still get the same performance when sharing the same context over the whole application?

gjakop · December 27, 2018, 2:03am

Hi @dumbdog,
Yes I still get good performance in case if I share one context for the whole application and if I create separate context for each encoding session I get bad performance. Everything is just like I described above.

I only need to note that I have to call cuMemcpy2D in decoder between cuvidCtxLock and cuvidCtxUnlock calls. After my initial post I’ve encountered with the case when decoder hangs on large number of decoding sessions if I don’t create lock with cuvidCtxLockCreate and use it to guard cuMemcpy2D calls. Once I start using lock in decoder everything works fine and stable now. Only one context(per GPU) for encoding and decoding. But again there is no any official comments from nvidia. I understand that I’m not a huge game developer or pretty valuable nvidia partner but we are not very small company with good commercial encoder with nvidia gpu support so I think it would be cool if nvidia look at my questions and confirm/correct my proposed approach. But nothing yet.

malakudi · January 2, 2019, 12:02pm

@gjakop: Is your software the “Transcoder for Nimble Streamer”?

With latest ffmpeg changes, there is performance degredation when using multiple transcodes (so multiple contexts) with certain features enabled. Check my ffmpeg bug report here:

http://trac.ffmpeg.org/ticket/7582

But if I use nvidia-cuda-mps, then there is no performance hit.

With your software, since you use one shared context per gpu, is the VRAM memory allocation less? Take a look at my report here https://devtalk.nvidia.com/default/topic/1039540/video-codec-sdk/nvdec-nvenc-vram-allocation-differences-between-different-gpus/ where there is higher memory allocation on certain GPUs when compared to Quadro P2000.

gjakop · January 3, 2019, 6:46am

Yes, I represent Nimble Streamer Transcoder team. How did you see that ?

Ffmpeg have to use mps to share the same context as usually multiple encoding/decoding sessions use multiple ffmpeg processes. We have a single process(with all pros and cons of this approach) and can share one context without MPS.

I’ve checked your report. I would say that VRAM was not a problem we ever faced because encoding/decoding GPU resources exhausted before VRAM so I cannot say for sure about VRAM utilization in case of sharing context or use separate context for eash encoding session.

malakudi · January 3, 2019, 5:50pm

Google fu I guess :-)
I searched for your nickname and “video encoder” and dropped to the probably yours twitter account.
I have sent you a pm about your software.

ebeese · December 16, 2019, 11:22pm

gjakop - do you set the vidLock param when creating the decoder? I’m using a very similar approach to you.

Thanks,

Erin

gjakop · December 17, 2019, 2:12am

Hi Erin,
yes, I set vidLock for CUVIDDECODECREATEINFO. I don’t recall for sure but probably something failed without this step. Maybe not. If you can force nvidia to review my approach and comment it - would be great. It still works stable but nvidia engineers have not confirmed it’s valid/“legal”.

Alex1000 · January 5, 2020, 5:44pm

Hi all,
well, I am using many tens of NVDEC decoders in a multiple-thread scenario in one CUDA context without any issues. What kind of problems do you guys have without using the cuvidCtxLock? This is what cuviddec.h says about the lock:

“If non-NULL, context lock used for synchronizing ownership of the cuda context. Needed for cudaVideoCreate_PreferCUDA decode.”

Since I was initially using cudaVideoCreate_PreferCUVID I just didn’t implement this lock. But I still have some CUDA stuff involved in the decoder threads…

Are we supposed to create one lock per decoder for its internal synchronization or should one lock be shared among all decoder instances?

I remember reading somewhere that cuvidMapVideoFrame acquires the lock automatically, is that true?

Alex

gjakop · January 8, 2020, 3:27am

Hi Alex,
we use cudaVideoCreate_PreferCUVID too but without “cuvidCtxLock” call before cuMemcpy2D we had a problem I described above in reply to @dumbdog. Again there are no any comments from nvidia team so I don’t know why this magic works. But the problem I described reproduced every day at least once if decoding sessions number were around 100. Maybe it’s not necessary anymore with latest nvidia GPU/drivers but I don’t know

Alex1000 · January 9, 2020, 5:44pm

Hi gjakop,

that’s interesting. Do you create one lock per decoder or one lock to be shared across all decoders? I have seen both approaches and both were written by NVidia developers… I don’t even understand why a lock should be necessary because if you are not using cuda streams, all CUDA stuff will automatically be enqeued (= serialized) to the default cuda stream of the active cuda context.

I would recommend to investigate the AppDecPerf sample from the latest SDK. If you look closely what they’re doing, you will find out, that they only run the decoders in parallel when each decoder lives in a separate cuda context. Once you use the “-single” option for using just one context, they create a std::mutex and only run the decode action for one decoder at a time (!) as if they were unsure whether the decoder supports running multiple instances in parallel. So they apply an even stronger locking mechanism not allowing the decoders to run in parallel at all…

Honestly, I don’t understand why this is not documented and why NVIdia does not comment on this (and similar) topics, it would be very helpful.

Alex

gjakop · January 10, 2020, 2:47am

I create one lock for all decoders and set it into vidLock field of CUVIDDECODECREATEINFO structure before each decoder creation. then I lock this single decoder before cuMemcpy2D calls and unlock after.
I’ve an option disabled by default to activate this behavior. But we recommend to enable it if user use single nvidia context. As I’m not certain about this approach I’ve added some flexibility there.
Again, all I stated there is no more than my observations, forums searching, examples reviews etc. Would be cool to get some Nvidia engineering team comments.

pretty similar approach I’ve mentioned above, right ? :) So as you see some locking added even in nvidia sdk samples regardless of tiny description in cuviddec.h

Yep, it would be cool to get some expert insight from manufacturer.

Alex1000 · January 12, 2020, 8:57am

Hi,
thanks for your comments.

Almost ;), two differences here:

They create one just CUvideoctxlock per decoder and never use it…
They serialize ALL decoding stuff inside one context explictly by using a mutex. In contrast, your approach may still introduce some (desired) parallel processing.

Topic		Replies	Views
CUDA multi-threaded programming CUDA Programming and Performance	3	4300	December 22, 2023
Encoding Video with NVCUVENC DEVICE_MEMORY_INPUT NVVE_DEVICE_MEMORY_INPUT CUDA Programming and Performance	35	6522	April 26, 2012
Cannot get any stream parallelism. CUDA Programming and Performance	13	1307	December 31, 2019
NVIDIA FFmpeg Transcoding Guide Technical Blog	24	5296	June 21, 2022
GPU Pro Tip: CUDA 7 Streams Simplify Concurrency Technical Blog	51	2177	February 5, 2020
NVDEC - Post decode performance issue Video Processing & Optical Flow	6	1405	May 14, 2020
GPU Context Switching Issue in DeepStream with CuPy Post-Processing DeepStream SDK cuda , cupy , jetson , deepstream	21	160	March 10, 2025
multi GPU programming issues time taken to create context? CUDA Programming and Performance	1	747	March 28, 2011
Session count limitation for NVENC (No Maxwell GPUs with 2+ NEVENC sessions?) GPU-Accelerated Libraries	25	33278	February 26, 2018
Synchronization hangs sporadically after kernel launch CUDA Programming and Performance	23	7853	August 20, 2015

Sharing the same Cuda context for encoding(NVENC) and decoding(NVDEC)

Related topics