Using MPI+multi-GPUs with CUDA 4.0

jocha · June 1, 2011, 1:31pm

Hi!

I’m currently in the process of creating a multiGPU application to run on 8 nodes, each of them fitted with 3 NVIDIA Tesla M2070 GPU.

My code is already using MPI for generating within each process a unique set of data related to a submesh and therefore
exchanges of data must be performed on interfaces to fully design the total grid of tetrahedra (3D mesh).

Within each submesh, data should be split and balanced among 3 different GPUs, hence creating a kind of second parallelism level.
The idea is to get rid of the previous configuration where each MPI process is holding a unique GPU context on its own device
(that means that previously a GPU was working on a whole submesh).

In that framework, adopting the CUDA 4.0 ability: Single thread access to all GPUs; I guess that direction to coding is to set up a single
MPI process per node, and then using unified virtual addressing for sharing my 3 GPUs memories?

On the other hand, considering that it’s possible with CUDA 4.0 to share GPUs acroos multiple threads, is it possible to use sub-communicators
for each node, to be able to launch concurrent kernels from different host threads?

Finally because it seems to me that the driver API seemed the most appropriate candidate so far (before CUDA 4.0) for handling multiple devices
from a single host thread, I would like to have your advice APIs. With CUDA 4.0, can I just use the runtime API or should I use the driver one?

Thank you for your time.

mfatica · June 1, 2011, 8:16pm

My suggestion is to start with an initial implementation that assign a single GPU to each MPI task and use the runtime API.
If the GPUs in the nodes are capable of P2P ( i.e. they belongs to the same pci-e root complex), you can then write an optimized version.

jocha · June 4, 2011, 2:52pm

Thanks for your reply!

My code already support the use of a single GPU assigned to each MPI process using “int dev = mpi_rang%2” and I’m using the runtime API to make it work.

As GPUs in the nodes are capable of P2P and I’d like to exploit also the unified virtual addressing within each node, do you think it would be feasible for me to keep on using the runtime API knowing that all nodes share and consume resources in an equipotent manner?

gpucsu · June 5, 2011, 4:24am

Is your source code for MPI available? We need to do something similar over an infiniband fabric between multi-GPU equipped nodes.

jocha · June 6, 2011, 10:01am

Sorry no, the code I’m working on can’t be public…

jocha · June 9, 2011, 1:11pm

Any other suggestions, advices would be greatly appreciated! Thanks

Topic		Replies	Views
CUDA and MPI Cluster Computing Implementation. Need advice for setting up MPI and CUDA over a cluste CUDA Programming and Performance	2	2478	February 19, 2010
Question about CUDA+MPI Legacy PGI Compilers	3	2627	March 13, 2018
Mutual exclusion MPI Windows CUDA Programming and Performance	6	3533	May 18, 2011
Mutiple GPUs with Runtime API and OpenMP CUDA Programming and Performance	10	4065	April 17, 2010
IDEA: Intrinsic multi-GPU support (Even over a network) CUDA Programming and Performance	7	9588	January 1, 2009
How to run these sample multi-gpu programs CUDA Programming and Performance	6	307	July 18, 2024
MPI + Peer2Peer combine MPI and Peer2Peer CUDA Programming and Performance	5	1813	February 8, 2012
MPI and CUDA C CUDA Programming and Performance	8	2492	December 7, 2009
Multi-GPU, MPI or threads? best choice for my multi-GPU solution? CUDA Programming and Performance	11	13010	February 16, 2011
CUDA multicore/mpi CUDA Programming and Performance	11	3544	September 3, 2008

Using MPI+multi-GPUs with CUDA 4.0

Related topics