CUDA + MPI: HtoD and DtoH in profiling & how to keep communication on the device

I am running a C++ program using MPI and CUDA. In my profiling, I’m just wanting to confirm that this tooltip is indicating that device memory is being copied to the host and then back to the device.

What I would prefer is that the communication did not need to go to the host and then back to the device. Is there a way to have all of these communications happen exclusively on the device? I might be missing a compiler flag when building `openmpi’? Perhaps I need a flag on the CUDA side? Or even a flag during run-time of my application?

If you use an ordinary MPI with CUDA then there will be host<->device traffic, for data exchanges between GPU buffers.

If you use a CUDA-aware MPI, then it should be possible to pass device buffer pointers to MPI (e.g. sendrecv) and in that case MPI should be able to do a more efficient transfer of data.

this blog should help to get started.