Using CUDA/CudaContexts simultanously from multiple CPU threads

Hi,

We have a system where we are doing the same algorithm (which has time-consuming CUDA parts in it - the optical flow between two images is calculated) for two independent CPU threads A and B simultanously. We test it on a system with a multi-core CPU and one Geforece GTX285, using the CUDA Runtime API.

Note that each CPU thread has its own ‘private/hidden’ Cuda Context.

There are - to my knowledge - two options how to do this:

[list=1]

[*] Create a ‘GPUWorker’ CPU -thread G which handles all the ‘CUDA computation requests’ of the CPU threads A and B and executes it (described in http://forums.nvidia.com/lofiversion/index.php?t66598.html)

[*] Each CPU threads A and B executes the CUDA cuda on its own ‘private’ Cuda Context

What is the better option for the given task ?

Regarding possibility (1), a disadvantage might be that by using this ‘GPUWorker’ thread G which handles all request, we introduce an artifical ‘serialization’ between the two independent CPU thread A and B. E.g. if thread A does a (synchronous) ‘cudaMemcpy’ from Host to Device and has to wait for the transfer to be finished, thread B cannot do anything on the GPU in this time.

An advantage of possiblity (1) is that all GPU resources are seen be A and B.

Another related question is - when i have create multiple threads in one process, where each thread has it own Cuda Context, is it possible for each Cuda Context to allocate the whole GPU device memory, or must all Cuda Contexts fit into the GPU memory (are the Cuda Contexts ‘swapped out’ on a CPU thread switch or not) ?

I also read about using CUDA ‘Streams’ (but in fact didn’t understand them) - are they an alternative to the two described options ?

I am having the same issues as above. But I am taking the approach of using separate cuda contexts. The only trouble is, I would be having close to 1000 contexts with this, as opposed to one if I had decided to share the same context for all the threads. I can’t afford to serialize operations in my case. I have around 4 - 5 threads that would operate on these 1000 contexts. If I use only one context, then each thread has to wait for it’s turn to make this single context current. This is something that I can’t do, since I am planning on redesigning my code, where I would use a dispatcher, which will assign to a thread to work with a particular context, and each of those assigned threads would just make that context current, use it and then pop it back. I would want to know how others approach this issue. Might help me think of a new way to do this. Didn’t mean to hijack your thread(was pointing out my approach to the issue :) )

Context swapping isn’t a cheap operation, so I think you want to avoid it at all costs. At least in Linux, multiple contexts compete for GPU resources on a first come, first served basis. This include memory (there is no concept of swapping or paging). WDDM versions of windows might work differently because there is an OS level GPU memory manager in play, but I don’t have any experience with it.

If you have a single GPU, I think you would do better running a persistent thread to hold the GPU context for the life of the application, and then feed the thread work from producer threads. That offers you the ability to impose you own scheduling logic on the GPU and explicitly control how work is processed. That is probably the GPUWorker model, but I am not very familiar with that code’s inner workings.

Streams are a mechanism for emitting asynchronous commands to a single GPU context so that overlap can occur between CUDA function calls (for example copying during kernel execution). It doesn’t break the basic 1:1 thread to device context paradigm that CUDA is based around. Kernel execution can’t overlap on current hardware (the new Fermi hardware it is supposed to eliminate this restriction).

How much is the cost of a ‘context switch’ between different Cuda Contexts in one GPU (e.g. in Windows XP), can it be estimated roughly ?

On windows, I have no idea of the cost. The way to measure it is pretty straightforward. Use a simple app with two threads - both start by establishing contexts. The first then allocates some memory, runs a little kernel, then hits a barrier and waits. The second initially waits for the other thread to reach the barrier then performs the same operation sequence. The difference between the execution time of the two sets of calls should be the context switch time.