Multi-GPU, MPI or threads? best choice for my multi-GPU solution?


I’m wondering what the best choice to control my Multi-GPU setup is. I currently have 3 GPUs which i need to control and do computing on in parallel. What are the pros and cons of using MPI VS pthreads (linux) or Boost threads (windows)?

One point of using MPI that I can see is that since the API calls look the same on both windows and linux a port between the two should be simple… I try to keep any OS dependencies out of the picture when it is feasible…

You can try going with 1 thread but separate contexts (1 for each gpu) if you don’t want to deal with threads.

How would that work? You can only have one active context per thread, right?

You can alter the current context with cuCtxPushCurrent, cuCtxPopCurrent.

cuCtxPushCurrent, cuCtxPopCurrent can be used to maintain more than one context but can this allow simultaneous execution on GPUs in a multi-GPU environment.

If you feel so, please give a short code snippet. Thanks in advance!!

With the advent of MPI 2.0, there are considerably fewer reasons at an API level favor pthreads over MPI than there used to be. The standard producer-consumer model is probably still easier to express with pthreads, but scatter-gather style parallelism is probably easier in MPI. I find Cartesian and Graph communicators to be very useful for the sort of multi-GPU work I do, but I wouldn’t necessarily use MPI for everything.

I’d expect the code to look something like this:

// launch multiple simulatneous kernels on several gpus

for (int i = 0; i < nGPUs; ++i)




  cuLaunchGridAsync(kernel[i],...); // for example



// wait for all gpus to finish

for (int i = 0; i < nGPUs; ++i)






Yes, of course you can do a context switch but that is very expensive and time consuming. Furthermore i want to control multiple GPUs in parallel, this doesn’t seem feasible to me, but perhaps you could explain further how this would work?

Expensive computational wise? That function doesn’t seem to do anything expensive - some synchronization, some checks and a tls array access. You can call it hundreds of thousands times per second. In the code make sure you call asynchronous versions of kernel launch and memory functions (like in the example I gave you in previous post).

I didn’t try this approach myself yet - but that would be the first thing to try for me in your situation.

Will give it a try eventhough I’m afraid this will invoke a context switch…

I’m more inclinded to use MPI and tying one cuda context to each GPU as this allows me to run things completely in parallel with several kernel invocation stages.

Thanks for the input!


I am new to multi-GPU systems.

I shall be grateful if you guys can verify my understanding:

1- Multi-GPUs can be used to run same kernels SIMULTANEOUSLY on different GPUs.

2- Computation speed will be thus doubled on a 2 GPU system as compared to a single GPU system.

3- For getting the advantage of two GPUs we need to create two host threads to control two GPUs. These two host threads will launch two kernels meant for two GPUs.

4- We will have to disable the SLI mode if we want to utilize the two GPUs for Computations.

5- The SLI can only benefit gaming applications.


Yes. Or different kernels on different GPUs for that matter.

Roughly yes IF the computation is independent. In some cases you will run into PCIe bandwidth bottleneck (if all GPUs start to copy data at the same time, the north bridge has limited bandwidth for PCIe slots). Naturally if there’s data dependence between those kernels, no parallelism between GPUs is possible.

That’s one way to do it, the most common one. You can create different processes instead of threads within a process, or you even can use a single thread and shuffle contexts around (some code overhead).

I’m not sure if that’s still needed. I think the drivers can now do mutli gpu with SLI enabled, but I’m not sure.

Pretty much yes, it’s of no use in CUDA or OpenCL computing.