Question about CUDA+MPI

Hi, I have a question about inter-node calculation using cuda+mpi
I succeeded in making cuda with mpi code.
But I can only use that code in a single node.
How can I use it in multiple node?

I set the MPI stuff by following code.

Integer:: myrank, nprocs, tag, ierr, localRank
Character(len=10):: localRankStr

call get_environment_variable&('OMPI_COMM_WORLD_LOCAL_RANK', localRankStr)
read(localRankStr , ' (i10) ' ) localRank
ierr = cudaSetDevice(localRank)

! MPI initialization
call MPI_init(ierr)
call MPI_comm_rank( MPI_COMM_WORLD , myrank , ierr )
call MPI_comm_size( MPI_COMM_WORlD , nprocs, ierr )

Secondly, can I assign GPU when I using multiple calculation?
I used that code by following command.

mpif90 -o3 -ta=tesla,cuda8.0 aaa.cuf
mpirun -np 4 ./3

Then, automatically, device number 0,1,2,3 were used.
Can I assign GPU like 1,3,5,7?

Hello,

I think what you are asking for is answers to the following
(remember we want to speed up code)

  1. With MPI, can I speed up my program by running two processes on the same Platform. Yes you can, because you may have high computation costs, but low memory use, so a multi-core machine
    may be able to run 2 or more MPI processes at full speed. Multi-user
    operating systems are needed for this.

  2. With CUDA+MPI, can I speed up my program by running two processes on the same Platform, and have them share the resources of one GPU.

Probably not. Running a single process that uses a GPU could be faster than having two processes where only EACH process uses the GPU. But two processes on a GPU would be done sequentially, and could be slower. Having only one process use the GPU may be the
best (but a little more complicated.

  1. With CUDA+MPI can I run two processes on on a single multi-core platform with 2 GPUs, with each process selecting and using one of the GPUs.

This may work, but the separate memory spaces may be a pain to use. But it will probably be slower than running a single
process on the platform, with a ‘multi-threaded’ OpenMP parallel section. Each thread of the single process selects GPU and runs the
kernel on it. Since it is multi-threaded, both GPUs work in the same memory space.


In general, it is better to run single MPI processes on platforms with
GPUs. Have the single process use multiple GPUs with OpenMP,
so the GPUs can run simultaneously.

dave

To add to Dave’s suggestions, you can compile with “-ta=host,tesla:cuda8.0” to create a unified binary. In this case, if the node has a GPU, then it will be used. Otherwise, the code will run sequentially on the host. Compiling with “-ta=multicore,tesla:cc8.0” will create a unified binary that will again run on the GPU if available otherwise run across all the cores of the host.

Then, automatically, device number 0,1,2,3 were used.
Can I assign GPU like 1,3,5,7?

Sure, assuming you have 8 GPUs on the system. Just change “localrank” in your call to “cudaSetDevice” to the GPU number you want to assign to the MPI Rank.

-Mat

Thank you Dave and thank you Mat!
I solved 2 problems!

Thanks again!