CUDA multi-threaded programming

Hello,

I am using NVCUVID to decode multiple video streams in parallel. My code works fine for a single stream, but I get run-time errors when trying to have more than one decoder in parallel, and I believe this is due to incorrect context management.

The most relevant comment I could find about this on the internet was from the designer of the API himself on http://neuron2.net/dgdecnv/cuda/cuda.html, where he says:

[i]You have to start everything [the decoder and parser] in the same thread that pushes the NALUs.

(…) A restriction of cuda is that the context is always associated with a single thread (though there are ways to get around that).

Because we use D3D for decode that retriction doesn’t apply to cuvidDecodePicture, but it does apply to every other cuvidXXX function. There are two ways to solve this:

  1. Create the CUDA context (cuD3D9CtxCreate) in the same thread that creates/destroys the decoder

  2. Use floating contexts (a bit more complicated, but you never have to worry about threading):

After calling cuD3D9CtxCreate, do the following:

CUcontext myContext;
CUvideoctxlock myLock;

cuCtxPopCurrent(&myContext);
cuvidCtxLockCreate(&myLock, myContext);

When creating the decoder, set CUVIDDECODECREATEINFO.vidLock = myLock. Then, whenever you want to make any cuda calls (such as cuMemCpyDtoH), do this:

cuvidCtxLock(&myLock);
… // cuMemcpy, cuvidMap/Unmap etc…
cuvidCtxUnlock(&myLock);

This will attach the context to the current thread, and automatically synchronize multiple threads that are competing for the same CUDA context.

(…)

I had some heated arguments with the CUDA designers about this, because I thought it was a ridiculous restriction in this day and age, but their answer was that it’s similar to the way OpenGL works blah blah blah (what’s worse was that there was no way to synchronize access to the context).

To get around it, I added the cuvidCtxLock objects so that multiple clients can have a common way to synchronize.[/i]

The rest of the dialogue seems to suggest that approach #2 is generally preferrable. If I understand it correctly, it means a single context and lock are shared among all threads, and the lock must be acquired before any call to CUDA. Therefore, all threads are effectively serialized, i.e. only one can work at a time! Doesn’t this nullify any performance benefits to multi-threading?

I could also go for approach #1 but I’m sure it entails properly pushing/popping the context and I don’t understand how that works. If a single context is shared between multiple threads, and both threads try to execute the following:

push context
call CUDA
pop context

The following can happen:

thread 1 : push context #1
thread 2 : push context #2
thread 1 : call cuda
thread 1 : pop context
thread 2 : call cuda
thread 2 : pop context

… and I’m not sure what’s supposed to happen in this case. All I know is I’ve gotten ERROR_INVALID_RESOURCE_HANDLE on properly registered resources, which apparently means the resource is being used in a context different than the one it’s been created on.

Also I’ve read that creating multiple contexts incurs overhead and isn’t the recommended approach since CUDA 4.

So, which approach to go for? If #1, how to correctly manage several threads that all have their own contexts and avoid the scenario described above? If #2, how to get a satisfactory level of parallelism?

Thanks for your guidance.

Hi,
I have exactly the same problem.

In a single thread it works fine, when I multiply number of threads for decoding, some errors appears…

How did you solve the problem ?
Do you share a single context with all of your thread which call cuda? or one context per decoding threads ?

The nvidia documentation is not too clear how using popContext function. Indeed for poping a context ont he first side, we can read that NULL must be passed to function…On the other side documentation said a context must be passed to this function.

Could you please give an advice how using context ?

Thks.

Ben

Guys, you need one thread per context.

if you want to keep the same context (generally using push/pop) then you will want your single thread to implement a loop with a controlling event.

I can give you sample code if needed.

BTW, the new link for the cited discussion is:

http://rationalqm.us/dgdecnv/cuda/cuda.html