We have a system where we are doing the same algorithm (which has time-consuming CUDA parts in it - the optical flow between two images is calculated) for two independent CPU threads A and B simultanously. We test it on a system with a multi-core CPU and one Geforece GTX285, using the CUDA Runtime API.
Note that each CPU thread has its own ‘private/hidden’ Cuda Context.
There are - to my knowledge - two options how to do this:
Create a ‘GPUWorker’ CPU -thread G which handles all the ‘CUDA computation requests’ of the CPU threads A and B and executes it (described in http://forums.nvidia.com/lofiversion/index.php?t66598.html)
Each CPU threads A and B executes the CUDA cuda on its own ‘private’ Cuda Context
What is the better option for the given task ?
Regarding possibility (1), a disadvantage might be that by using this ‘GPUWorker’ thread G which handles all request, we introduce an artifical ‘serialization’ between the two independent CPU thread A and B. E.g. if thread A does a (synchronous) ‘cudaMemcpy’ from Host to Device and has to wait for the transfer to be finished, thread B cannot do anything on the GPU in this time.
An advantage of possiblity (1) is that all GPU resources are seen be A and B.
Another related question is - when i have create multiple threads in one process, where each thread has it own Cuda Context, is it possible for each Cuda Context to allocate the whole GPU device memory, or must all Cuda Contexts fit into the GPU memory (are the Cuda Contexts ‘swapped out’ on a CPU thread switch or not) ?
I also read about using CUDA ‘Streams’ (but in fact didn’t understand them) - are they an alternative to the two described options ?