Using MPI+multi-GPUs with CUDA 4.0


I’m currently in the process of creating a multiGPU application to run on 8 nodes, each of them fitted with 3 NVIDIA Tesla M2070 GPU.

My code is already using MPI for generating within each process a unique set of data related to a submesh and therefore
exchanges of data must be performed on interfaces to fully design the total grid of tetrahedra (3D mesh).

Within each submesh, data should be split and balanced among 3 different GPUs, hence creating a kind of second parallelism level.
The idea is to get rid of the previous configuration where each MPI process is holding a unique GPU context on its own device
(that means that previously a GPU was working on a whole submesh).

In that framework, adopting the CUDA 4.0 ability: Single thread access to all GPUs; I guess that direction to coding is to set up a single
MPI process per node, and then using unified virtual addressing for sharing my 3 GPUs memories?

On the other hand, considering that it’s possible with CUDA 4.0 to share GPUs acroos multiple threads, is it possible to use sub-communicators
for each node, to be able to launch concurrent kernels from different host threads?

Finally because it seems to me that the driver API seemed the most appropriate candidate so far (before CUDA 4.0) for handling multiple devices
from a single host thread, I would like to have your advice APIs. With CUDA 4.0, can I just use the runtime API or should I use the driver one?

Thank you for your time.

My suggestion is to start with an initial implementation that assign a single GPU to each MPI task and use the runtime API.
If the GPUs in the nodes are capable of P2P ( i.e. they belongs to the same pci-e root complex), you can then write an optimized version.

Thanks for your reply!

My code already support the use of a single GPU assigned to each MPI process using “int dev = mpi_rang%2” and I’m using the runtime API to make it work.

As GPUs in the nodes are capable of P2P and I’d like to exploit also the unified virtual addressing within each node, do you think it would be feasible for me to keep on using the runtime API knowing that all nodes share and consume resources in an equipotent manner?

Is your source code for MPI available? We need to do something similar over an infiniband fabric between multi-GPU equipped nodes.

Sorry no, the code I’m working on can’t be public…

Any other suggestions, advices would be greatly appreciated! Thanks