I’m currently in the process of creating a multiGPU application to run on 8 nodes, each of them fitted with 3 NVIDIA Tesla M2070 GPU.
My code is already using MPI for generating within each process a unique set of data related to a submesh and therefore
exchanges of data must be performed on interfaces to fully design the total grid of tetrahedra (3D mesh).
Within each submesh, data should be split and balanced among 3 different GPUs, hence creating a kind of second parallelism level.
The idea is to get rid of the previous configuration where each MPI process is holding a unique GPU context on its own device
(that means that previously a GPU was working on a whole submesh).
In that framework, adopting the CUDA 4.0 ability: Single thread access to all GPUs; I guess that direction to coding is to set up a single
MPI process per node, and then using unified virtual addressing for sharing my 3 GPUs memories?
On the other hand, considering that it’s possible with CUDA 4.0 to share GPUs acroos multiple threads, is it possible to use sub-communicators
for each node, to be able to launch concurrent kernels from different host threads?
Finally because it seems to me that the driver API seemed the most appropriate candidate so far (before CUDA 4.0) for handling multiple devices
from a single host thread, I would like to have your advice APIs. With CUDA 4.0, can I just use the runtime API or should I use the driver one?
Thank you for your time.