CUDA and MPI Cluster Computing Implementation. Need advice for setting up MPI and CUDA over a cluste

I need to be able to run a program using all the GPUs in a two computer cluster. There are five in total, 3 in one machine, 2 in the other. My idea for implementing such a program is the following:

    Start MPI

    Initialize device driver threads for each device (on both computers)

    Send data to secondary computer

    Send data to all GPUs

    Do required operations

    Terminate GPU threads

    Send data back to primary computer

    Stop MPI

My question is, will this work? Also, I am having trouble finding documentation on how to implement MPI over a cluster. Does the program need to be present on each computer in the cluster, or will MPI transfer the program into the computer’s local memory? I will be using an SPMD paradigm.

This is being done on a cluster with both computers running CentOS x64 (most recent version). I am using MPICH as my MPI Implementation, and CUDA 2.3 as my CUDA implementation. Any help or advice would be greatly appreciated.

~Alex

This thread has some good links on parallel programming: http://forums.nvidia.com/index.php?showtopic=107375

You may get more interest if you describe in some detail the nature of your application.

That is a bit of a curious statement. MPI is primarily intended for use on distributed memory machines and clusters. The interweb is literally overflowing with tutorials and documentation describing how to use it on clusters. LLNL maintain a very useful set of introductions to many of the APIs used in HPC, for example. There material on MPI can be read here.

It would seem you have two basic design choices - either use host threads for running multiple gpus on each cluster node, and then use MPI for inter-node communications (so the number of members in the MPI communicator is equal to the number of nodes), or just use MPI and run one process per gpu (so the number of members in the MPI communicator is equal to the number of GPUs). The latter will be simpler, because it only requires one API, not two, but it potentially won’t perform as well, because MPI processes are considerably “heavier” than host threads, and things which happen naturally with threads within a shared memory space requires explicit data exchange in MPI, which increases communication overhead.