Sharing the same Cuda context for encoding(NVENC) and decoding(NVDEC)

We have a program with encoding and decoding capabilities. When it was initially implemented we checked ffmpeg implementation and implemented the same approach - use individual cuda context for each decoding and encoding session. This approach works fine in case of ffmpeg as user usually run several ffmpeg for different stream and it looks like nvidia driver works fine in case if multiple contexts created from different processes. In our case we have only one process and faced with the case that multiple contexts created slowly and we were not able to create more that 20 contexts in our application even on powerful GPU like Tesla P100.
I checked different applications and stackoverflow posts and found that some developers use only one context for both encoding and decoding. I checked this approach and it works fine to me:

  • create context cuCtxCreate
  • detach context from current thread cuCtxPopCurrent
  • set the context as current for every thread before use it via cuCtxSetCurrent
  • use the context in all encoding and decoding thread
  • use cuvidCreateDecoder to create decoder or
  • use NV_ENC_OPEN_ENCODE_SESSION_EX_PARAMS.device = the_context; nvEncOpenEncodeSessionEx to create encoder.

Everything works fine and this approach looks just brilliant. I just need nvidia|experts confirmation that it’s legal approach and not a side effect. Please note that without cuCtxSetCurrent call encoder will fail on second encoder session. So this call required even though context pushed in “device” parameter of NV_ENC_OPEN_ENCODE_SESSION_EX_PARAMS during encoding session creation nvEncOpenEncodeSessionEx. So I think cuCtxSetCurrent is actually mark context as sharable because in case if we have one context-one encoding session this call not required. But I cannot find any confirmation in docs.

There are also cuvidCtxLockCreate, cuvidCtxLock, cuvidCtxUnlock functions in SDK used for “Context-locking: to facilitate multi-threaded implementations” according to description. These function are not deprecated but I don’t use them in my current implementation and everything is fine. Should I use them or cuCtxSetCurrent is enough for both encoding threads and decoding threads.
Thanks a lot !

Nvidia support/engineering team,
Could you please review provided description and reply?
I did all possible research and read all available docs and was not able to find an answer.

Any news on that?
Do you still get the same performance when sharing the same context over the whole application?

Hi @dumbdog,
Yes I still get good performance in case if I share one context for the whole application and if I create separate context for each encoding session I get bad performance. Everything is just like I described above.

I only need to note that I have to call cuMemcpy2D in decoder between cuvidCtxLock and cuvidCtxUnlock calls. After my initial post I’ve encountered with the case when decoder hangs on large number of decoding sessions if I don’t create lock with cuvidCtxLockCreate and use it to guard cuMemcpy2D calls. Once I start using lock in decoder everything works fine and stable now. Only one context(per GPU) for encoding and decoding. But again there is no any official comments from nvidia. I understand that I’m not a huge game developer or pretty valuable nvidia partner but we are not very small company with good commercial encoder with nvidia gpu support so I think it would be cool if nvidia look at my questions and confirm/correct my proposed approach. But nothing yet.

@gjakop: Is your software the “Transcoder for Nimble Streamer”?

With latest ffmpeg changes, there is performance degredation when using multiple transcodes (so multiple contexts) with certain features enabled. Check my ffmpeg bug report here:

http://trac.ffmpeg.org/ticket/7582

But if I use nvidia-cuda-mps, then there is no performance hit.

With your software, since you use one shared context per gpu, is the VRAM memory allocation less? Take a look at my report here https://devtalk.nvidia.com/default/topic/1039540/video-codec-sdk/nvdec-nvenc-vram-allocation-differences-between-different-gpus/ where there is higher memory allocation on certain GPUs when compared to Quadro P2000.

Yes, I represent Nimble Streamer Transcoder team. How did you see that ?

Ffmpeg have to use mps to share the same context as usually multiple encoding/decoding sessions use multiple ffmpeg processes. We have a single process(with all pros and cons of this approach) and can share one context without MPS.

I’ve checked your report. I would say that VRAM was not a problem we ever faced because encoding/decoding GPU resources exhausted before VRAM so I cannot say for sure about VRAM utilization in case of sharing context or use separate context for eash encoding session.

Google fu I guess :-)
I searched for your nickname and “video encoder” and dropped to the probably yours twitter account.
I have sent you a pm about your software.

gjakop - do you set the vidLock param when creating the decoder? I’m using a very similar approach to you.

Thanks,

Erin

Hi Erin,
yes, I set vidLock for CUVIDDECODECREATEINFO. I don’t recall for sure but probably something failed without this step. Maybe not. If you can force nvidia to review my approach and comment it - would be great. It still works stable but nvidia engineers have not confirmed it’s valid/“legal”.

Hi all,
well, I am using many tens of NVDEC decoders in a multiple-thread scenario in one CUDA context without any issues. What kind of problems do you guys have without using the cuvidCtxLock? This is what cuviddec.h says about the lock:

“If non-NULL, context lock used for synchronizing ownership of the cuda context. Needed for cudaVideoCreate_PreferCUDA decode.”

Since I was initially using cudaVideoCreate_PreferCUVID I just didn’t implement this lock. But I still have some CUDA stuff involved in the decoder threads…

Are we supposed to create one lock per decoder for its internal synchronization or should one lock be shared among all decoder instances?

I remember reading somewhere that cuvidMapVideoFrame acquires the lock automatically, is that true?

Alex

Hi Alex,
we use cudaVideoCreate_PreferCUVID too but without “cuvidCtxLock” call before cuMemcpy2D we had a problem I described above in reply to @dumbdog. Again there are no any comments from nvidia team so I don’t know why this magic works. But the problem I described reproduced every day at least once if decoding sessions number were around 100. Maybe it’s not necessary anymore with latest nvidia GPU/drivers but I don’t know

Hi gjakop,

that’s interesting. Do you create one lock per decoder or one lock to be shared across all decoders? I have seen both approaches and both were written by NVidia developers… I don’t even understand why a lock should be necessary because if you are not using cuda streams, all CUDA stuff will automatically be enqeued (= serialized) to the default cuda stream of the active cuda context.

I would recommend to investigate the AppDecPerf sample from the latest SDK. If you look closely what they’re doing, you will find out, that they only run the decoders in parallel when each decoder lives in a separate cuda context. Once you use the “-single” option for using just one context, they create a std::mutex and only run the decode action for one decoder at a time (!) as if they were unsure whether the decoder supports running multiple instances in parallel. So they apply an even stronger locking mechanism not allowing the decoders to run in parallel at all…

Honestly, I don’t understand why this is not documented and why NVIdia does not comment on this (and similar) topics, it would be very helpful.

Alex

I create one lock for all decoders and set it into vidLock field of CUVIDDECODECREATEINFO structure before each decoder creation. then I lock this single decoder before cuMemcpy2D calls and unlock after.
I’ve an option disabled by default to activate this behavior. But we recommend to enable it if user use single nvidia context. As I’m not certain about this approach I’ve added some flexibility there.
Again, all I stated there is no more than my observations, forums searching, examples reviews etc. Would be cool to get some Nvidia engineering team comments.

pretty similar approach I’ve mentioned above, right ? :) So as you see some locking added even in nvidia sdk samples regardless of tiny description in cuviddec.h

Yep, it would be cool to get some expert insight from manufacturer.

Hi,
thanks for your comments.

Almost ;), two differences here:

  1. They create one just CUvideoctxlock per decoder and never use it…
  2. They serialize ALL decoding stuff inside one context explictly by using a mutex. In contrast, your approach may still introduce some (desired) parallel processing.