I am running a C++ program using MPI and CUDA. In my profiling, I’m just wanting to confirm that this tooltip is indicating that device memory is being copied to the host and then back to the device.
What I would prefer is that the communication did not need to go to the host and then back to the device. Is there a way to have all of these communications happen exclusively on the device? I might be missing a compiler flag when building `openmpi’? Perhaps I need a flag on the CUDA side? Or even a flag during run-time of my application?
