Multiple GPU computing

HI I have two GPU’s on my system

can someone please let me know
how to accomplish the task of implementing
two matrix multiplications on both the GPU’s
at the same time(in parallel). :ermm:
is this possible to do two tasks separately on
the GPU’s at the same time. :rolleyes:

I have idea about cudasetdevice() but no idea on using both the GPU’s
at the same time.

Can someone please enlighten me on this issue.

Thank you

A CPU thread can only use one CUDA device, so to use multiple CUDA devices, you need to start multiple CPU threads, one for each CUDA device. Take a look at the simpleMultiGPU and MonteCarloMultiGPU projects in the SDK for examples.

I believe this is not completely true. A GPU context can only live in 1 CPU thread, but you can control more than 1 GPU from a single CPU thread (I have not found any document stating otherwise yet). This is not optimal for simulations, where you want to have your CPU spinning for the GPU to finish its work, so you can immediately start the next job for the GPU, but for realtime applications, when using streams, you can from 1 CPU thread distribute your work across more than 1 GPU as far as I know. You just poll if one of the GPU’s is ready, and give this GPU the next piece of work to perform.

Really? Is this is demonstrated in one of the SDK examples?

One thread is tied to one GPU context, as long as you aren’t using the 2.0 beta driver API context switching which is for sharing contexts among libraries and applications.

Each GPU context has it’s own protected memory space, and device pointers cannot be shared between them. The GPU that the context is assigned to is done using cudaSetDevice(). Once a host thread is associated with a context, it can’t “see” anything on the device outside of it’s little context. So, there is no way for a single host thread to control more than one GPU.

That isn’t to say that a single host thread controlling multiple GPUs wouldn’t be convenient. I will be performing this in my own code using worker threads and function delegates. I.e., once I write the code I will be able to do something like this in one thread:

gpu1->call(bind(cudaSetDevice, 0));

gpu2->call(bind(cudaSetDevice, 1));

gpu1->call(bind(cudaMalloc, &d_gpu1, other args));

gpu2->call(bind(cudaMalloc, &d_gpu2, other args));

gpu1->call(bind(cudaMemcpy, d_gpu1, other args));

gpu2->call(bind(cudaMemcpy, d_gpu2, other args));

gpu1->call(bind(runKerenel, d_gpu1, other args));

gpu2->call(bind(runKernel, d_gpu2, other args));

gpu1 and gpu2 are the worker threads. “call” just pushes the function call tied up by boost::bind onto a queue. The worker thread pulls the calls off the queue and calls them. It will be a bit heavy on the requirements (C++ host code and linked to the boost library), but as you can see the syntax is pretty slick, allowing for any function to be passed into the queue. Another upshot is that all call()'s will automatically be the equivalent of CUDA_SAFE_CALL, throwing an exception if an error is reported (in debug mode only).

If anyone is interested, the code will be open sourced once I write it.

Yes, there will be some overhead in calling functions with boost::bind and passing them to worker threads. However 1) My application targets a maximum of thousands of calls per second which shouldn’t be a problem (… I hope … will test). 2) If the GPU is kept busy ~100% of the time, much of the cost of queuing up the function delegates will just be overlapped with the GPU execution and effectively cost nothing in the end.

Very slick, MisterAnderson. I would be interested in the code once it is finished.

Somehow I had the false impression that you could ‘bind’ streams to a certain GPU. So I was thinking of having 2 streams controlled from 1 thread, where each stream ran on 1 GPU. But reading the manual a bit better, I understand that this is not possible. I think I got this idea when looking at some stream example.

Anyhow, being able to bind a stream to a GPU, and through that mechanism controlling N GPU’s from 1 thread would be a very nice feature if possible :D

I’ve done something similar–except that I’m apparently not a good enough C++ programmer to find a good solution to passing the arbitrary list of arguments into threads ( I was trying to use use va_lists from <stdarg.h>) and thus have a class interface like


float* d_a = gpu1.gpuMalloc(n);

float* d_b = gpu1.gpuMalloc(n);


Guess I’m not the only crazy C++ programmer here… Or is it “great minds think alike” ;)

My worker thread class is now complete, details and code links in this forum post