Limitations on using GPU with a multi-thread program


I’ve developed a multi-threaded program which handles the execution of other programs on one or more GPU. There is a thread allocating and moving data and a thread executing kernels. That works on a Tesla C 1060 but not on a Quadro FX 5800. With the quadro, device pointers are invalid for the thread executing kernels (the thread which doesn’t create pointers but which receives them). I guess there are limitations on the use of a GPU by a group of different threads, but what are they ? Why does it work on the Tesla and not on the Quadro ? How can I know if a GPU can handle the multi-threaded management or not ?

Thanks for your help.

Best regards.

Are you using CUDA 4 or an older version ?

I guess the limitations you see is that the pointers are only valid inside a context.

CUDA 4 makes it very easy to work with several GPUs and threads, any CPU thread can control any GPU, just call cudaSetDevice() before you want to control a GPU.
So with CUDA 4, your CPU thread managing the memory could switch to the right GPU context anytime with cudaSetDevice, and your computing thread can do the same.
Before CUDA 4, you’re not allowed to call cudaSetDevice more than once, and so you have to manage the GPU contexts explicitly, it’s a mess, and if you don’t do it right, it only works when you’re lucky.

Now, looking at the big picture of what you’re doing, having separate threads to manage the same GPU (one thread for memory allocation and transfer, one thread for kernel execution) seems an overkill.
With streams and asynchronous copies, you should be able to control everything from only one single CPU thread, even with several GPUs, if you use CUDA 4.