One thread is tied to one GPU context, as long as you aren’t using the 2.0 beta driver API context switching which is for sharing contexts among libraries and applications.
Each GPU context has it’s own protected memory space, and device pointers cannot be shared between them. The GPU that the context is assigned to is done using cudaSetDevice(). Once a host thread is associated with a context, it can’t “see” anything on the device outside of it’s little context. So, there is no way for a single host thread to control more than one GPU.
That isn’t to say that a single host thread controlling multiple GPUs wouldn’t be convenient. I will be performing this in my own code using worker threads and function delegates. I.e., once I write the code I will be able to do something like this in one thread:
gpu1->call(bind(cudaSetDevice, 0));
gpu2->call(bind(cudaSetDevice, 1));
gpu1->call(bind(cudaMalloc, &d_gpu1, other args));
gpu2->call(bind(cudaMalloc, &d_gpu2, other args));
gpu1->call(bind(cudaMemcpy, d_gpu1, other args));
gpu2->call(bind(cudaMemcpy, d_gpu2, other args));
gpu1->call(bind(runKerenel, d_gpu1, other args));
gpu2->call(bind(runKernel, d_gpu2, other args));
gpu1 and gpu2 are the worker threads. “call” just pushes the function call tied up by boost::bind onto a queue. The worker thread pulls the calls off the queue and calls them. It will be a bit heavy on the requirements (C++ host code and linked to the boost library), but as you can see the syntax is pretty slick, allowing for any function to be passed into the queue. Another upshot is that all call()'s will automatically be the equivalent of CUDA_SAFE_CALL, throwing an exception if an error is reported (in debug mode only).
If anyone is interested, the code will be open sourced once I write it.
Yes, there will be some overhead in calling functions with boost::bind and passing them to worker threads. However 1) My application targets a maximum of thousands of calls per second which shouldn’t be a problem (… I hope … will test). 2) If the GPU is kept busy ~100% of the time, much of the cost of queuing up the function delegates will just be overlapped with the GPU execution and effectively cost nothing in the end.