An Introduction to CUDA-Aware MPI

Originally published at:

MPI, the Message Passing Interface, is a standard API for communicating data via messages between distributed processes that is commonly used in HPC to build applications that can scale to multi-node computer clusters. As such, MPI is fully compatible with CUDA, which is designed for parallel computing on a single computer or node. There are many reasons…


I have a question about the MPI communication, I didn't understand if it can be used to transfer data between gpu's in the same machine.

Thank you in advance.

Hi Miguel,

Yes. MPI can be used for communication between GPUs, both within a node and across nodes. MPI supports Intra-node (within the node) and Inter-node (across cluster nodes) communication. MVAPICH2 is a CUDA-Aware MPI library ( that you can use to perform communication for GPUs in the same machine as well across machines.

Can it be used to split GPU computing power for smaller pieces for example to rent someone?

how about mpi4py in python is that pycuda aware?

Hi, Jiri,
A new question. With CUDA aware MPI, in MPI_Send, if the send buffer is a device pointer and the data is produced by a previous compute kernel (might be on a non-default stream, might be results of a cuBLAS call, or in other words, as a library developer, I don't know where the data is from), do I need to call cudaDeviceSynchronize() before the send?
After MPI_Recv(), if the recv buffer is a device pointer, can I access it immediately in a kernel on a new stream?