I am using CUDA-aware MPI with PGI 17.7 with the OpenMPI that came with the compiler.
The code works but when I profile the code, the MPI routines are taking longer than they did using the CPU-only code.
Using pgprof, I can see several async memory transfers happening from the device to host and host to device around the MPI calls.
I am running the code on only 1 GPU (a GeForce 970) so I do not understand why the code is making these transfers.
(I have learned that GeForce does not support GPUdirect RDMA, but even so, if the MPI destination is the same card, shouldn’t the compiler/library use a device-to-device copy instead of the host transfers?)