Multi-GPU, MPI or threads? best choice for my multi-GPU solution?

_constant · December 23, 2010, 12:06am

Hi!

I’m wondering what the best choice to control my Multi-GPU setup is. I currently have 3 GPUs which i need to control and do computing on in parallel. What are the pros and cons of using MPI VS pthreads (linux) or Boost threads (windows)?

One point of using MPI that I can see is that since the API calls look the same on both windows and linux a port between the two should be simple… I try to keep any OS dependencies out of the picture when it is feasible…

sergeyn · December 23, 2010, 12:12am

You can try going with 1 thread but separate contexts (1 for each gpu) if you don’t want to deal with threads.

_constant · December 23, 2010, 12:43am

How would that work? You can only have one active context per thread, right?

sergeyn · December 23, 2010, 12:49am

You can alter the current context with cuCtxPushCurrent, cuCtxPopCurrent.

Crankie · December 23, 2010, 4:52am

cuCtxPushCurrent, cuCtxPopCurrent can be used to maintain more than one context but can this allow simultaneous execution on GPUs in a multi-GPU environment.

If you feel so, please give a short code snippet. Thanks in advance!!

avidday · December 23, 2010, 10:25am

With the advent of MPI 2.0, there are considerably fewer reasons at an API level favor pthreads over MPI than there used to be. The standard producer-consumer model is probably still easier to express with pthreads, but scatter-gather style parallelism is probably easier in MPI. I find Cartesian and Graph communicators to be very useful for the sort of multi-GPU work I do, but I wouldn’t necessarily use MPI for everything.

sergeyn · December 23, 2010, 10:30am

I’d expect the code to look something like this:

// launch multiple simulatneous kernels on several gpus

for (int i = 0; i < nGPUs; ++i)

{

  cuCtxPushCurrent(context[i]);

//... CUDA CODE FOR SPECIFIC DEVICE HERE

  cuLaunchGridAsync(kernel[i],...); // for example

cuCtxPopCurrent();

}

// wait for all gpus to finish

for (int i = 0; i < nGPUs; ++i)

{

  cuCtxPushCurrent(context[i]);

  cuCtxSyncronize();

  cuCtxPopCurrent();

}

_constant · December 23, 2010, 10:31am

Yes, of course you can do a context switch but that is very expensive and time consuming. Furthermore i want to control multiple GPUs in parallel, this doesn’t seem feasible to me, but perhaps you could explain further how this would work?

sergeyn · December 23, 2010, 11:00am

Expensive computational wise? That function doesn’t seem to do anything expensive - some synchronization, some checks and a tls array access. You can call it hundreds of thousands times per second. In the code make sure you call asynchronous versions of kernel launch and memory functions (like in the example I gave you in previous post).

I didn’t try this approach myself yet - but that would be the first thing to try for me in your situation.

_constant · December 27, 2010, 1:42pm

Will give it a try eventhough I’m afraid this will invoke a context switch…

I’m more inclinded to use MPI and tying one cuda context to each GPU as this allows me to run things completely in parallel with several kernel invocation stages.

Thanks for the input!

CUDA_OpenCL · February 16, 2011, 7:59am

Hi,

I am new to multi-GPU systems.

I shall be grateful if you guys can verify my understanding:

1- Multi-GPUs can be used to run same kernels SIMULTANEOUSLY on different GPUs.

2- Computation speed will be thus doubled on a 2 GPU system as compared to a single GPU system.

3- For getting the advantage of two GPUs we need to create two host threads to control two GPUs. These two host threads will launch two kernels meant for two GPUs.

4- We will have to disable the SLI mode if we want to utilize the two GPUs for Computations.

5- The SLI can only benefit gaming applications.

Thanks

_Big_Mac · February 16, 2011, 10:37pm

Yes. Or different kernels on different GPUs for that matter.

Roughly yes IF the computation is independent. In some cases you will run into PCIe bandwidth bottleneck (if all GPUs start to copy data at the same time, the north bridge has limited bandwidth for PCIe slots). Naturally if there’s data dependence between those kernels, no parallelism between GPUs is possible.

That’s one way to do it, the most common one. You can create different processes instead of threads within a process, or you even can use a single thread and shuffle contexts around (some code overhead).

I’m not sure if that’s still needed. I think the drivers can now do mutli gpu with SLI enabled, but I’m not sure.

Pretty much yes, it’s of no use in CUDA or OpenCL computing.

Topic		Replies	Views
Multi-GPU with a single thread and driver API? CUDA Programming and Performance	5	4985	July 25, 2008
IDEA: Intrinsic multi-GPU support (Even over a network) CUDA Programming and Performance	7	9587	January 1, 2009
Multiple GPU computing CUDA Programming and Performance	8	7878	May 7, 2008
pthreads vs. OpenMP? CUDA Programming and Performance	4	4832	February 18, 2013
Mutiple GPUs with Runtime API and OpenMP CUDA Programming and Performance	10	4065	April 17, 2010
Cuda streams vs Cuda+MPI How the different CPU processes access to the GPU? CUDA Programming and Performance	13	15936	March 20, 2011
CUDA and MPI Cluster Computing Implementation. Need advice for setting up MPI and CUDA over a cluste CUDA Programming and Performance	2	2478	February 19, 2010
Is it possible using muliple context for a GPU. mulitple CPU thread CUDA Programming and Performance	10	4856	April 8, 2009
Using multiple GPUs Legacy PGI Compilers	7	22076	August 11, 2009
Managing multiple GPUs from a single host thread CUDA Programming and Performance	1	1209	October 10, 2010

Multi-GPU, MPI or threads? best choice for my multi-GPU solution?

Related topics