CUDA multi-threaded programming

Dr_Asik · February 11, 2013, 9:00pm

Hello,

I am using NVCUVID to decode multiple video streams in parallel. My code works fine for a single stream, but I get run-time errors when trying to have more than one decoder in parallel, and I believe this is due to incorrect context management.

The most relevant comment I could find about this on the internet was from the designer of the API himself on Best Codecs and Video Processing Software, Promo Codes & Deals, where he says:

[i]You have to start everything [the decoder and parser] in the same thread that pushes the NALUs.

(…) A restriction of cuda is that the context is always associated with a single thread (though there are ways to get around that).

Because we use D3D for decode that retriction doesn’t apply to cuvidDecodePicture, but it does apply to every other cuvidXXX function. There are two ways to solve this:

Create the CUDA context (cuD3D9CtxCreate) in the same thread that creates/destroys the decoder
Use floating contexts (a bit more complicated, but you never have to worry about threading):

After calling cuD3D9CtxCreate, do the following:

CUcontext myContext;
CUvideoctxlock myLock;

cuCtxPopCurrent(&myContext);
cuvidCtxLockCreate(&myLock, myContext);

When creating the decoder, set CUVIDDECODECREATEINFO.vidLock = myLock. Then, whenever you want to make any cuda calls (such as cuMemCpyDtoH), do this:

cuvidCtxLock(&myLock);
… // cuMemcpy, cuvidMap/Unmap etc…
cuvidCtxUnlock(&myLock);

This will attach the context to the current thread, and automatically synchronize multiple threads that are competing for the same CUDA context.

(…)

I had some heated arguments with the CUDA designers about this, because I thought it was a ridiculous restriction in this day and age, but their answer was that it’s similar to the way OpenGL works blah blah blah (what’s worse was that there was no way to synchronize access to the context).

To get around it, I added the cuvidCtxLock objects so that multiple clients can have a common way to synchronize.[/i]

The rest of the dialogue seems to suggest that approach #2 is generally preferrable. If I understand it correctly, it means a single context and lock are shared among all threads, and the lock must be acquired before any call to CUDA. Therefore, all threads are effectively serialized, i.e. only one can work at a time! Doesn’t this nullify any performance benefits to multi-threading?

I could also go for approach #1 but I’m sure it entails properly pushing/popping the context and I don’t understand how that works. If a single context is shared between multiple threads, and both threads try to execute the following:

push context
call CUDA
pop context

The following can happen:

thread 1 : push context #1
thread 2 : push context #2
thread 1 : call cuda
thread 1 : pop context
thread 2 : call cuda
thread 2 : pop context

… and I’m not sure what’s supposed to happen in this case. All I know is I’ve gotten ERROR_INVALID_RESOURCE_HANDLE on properly registered resources, which apparently means the resource is being used in a context different than the one it’s been created on.

Also I’ve read that creating multiple contexts incurs overhead and isn’t the recommended approach since CUDA 4.

So, which approach to go for? If #1, how to correctly manage several threads that all have their own contexts and avoid the scenario described above? If #2, how to get a satisfactory level of parallelism?

Thanks for your guidance.

benoit.lagadec · October 21, 2017, 2:47pm

Hi,
I have exactly the same problem.

In a single thread it works fine, when I multiply number of threads for decoding, some errors appears…

How did you solve the problem ?
Do you share a single context with all of your thread which call cuda? or one context per decoding threads ?

The nvidia documentation is not too clear how using popContext function. Indeed for poping a context ont he first side, we can read that NULL must be passed to function…On the other side documentation said a context must be passed to this function.

Could you please give an advice how using context ?

Thks.

Ben

electrodynamics · October 24, 2017, 7:54pm

Guys, you need one thread per context.

if you want to keep the same context (generally using push/pop) then you will want your single thread to implement a loop with a controlling event.

I can give you sample code if needed.

BTW, the new link for the cited discussion is:

http://rationalqm.us/dgdecnv/cuda/cuda.html

657564573 · December 22, 2023, 10:27am

if one thread per context, can they sharing common resources between threads？

Topic		Replies	Views
Sharing the same Cuda context for encoding(NVENC) and decoding(NVDEC) Video Codec, PyNv & OFA	13	4713	January 12, 2020
Threaded Cuda video decoding CUDA Programming and Performance	6	6861	June 20, 2017
Is CUvideoctxlock redundant? Video Codec, PyNv & OFA cuda , video	3	1252	February 21, 2023
NVDEC - CUvideoctxlock questions (an authoritative explanation needed) Video Codec, PyNv & OFA	0	773	January 6, 2020
how to use the same CUDA context in several CPU threads? CUDA Programming and Performance	5	1722	October 28, 2010
the cuda context problem on Multiple decoder instances Video Codec, PyNv & OFA	1	752	September 14, 2018
questions memory allocation and CUDA contexts CUDA Programming and Performance	7	11421	February 4, 2008
Support for multi-threaded apps on cuda and multiple applications on cuda CUDA Programming and Performance	13	12914	January 24, 2011
video cards in parallel ? how the use of various video cards in parallel? CUDA Programming and Performance	7	865	July 15, 2011
How to use CUcontext in multiple threads for a multiple camera system? (Argus / SyncSensor) Jetson Xavier NX camera , cuda , jetson	4	841	August 16, 2023

CUDA multi-threaded programming

Related topics