CUDA-AWARE MPI does not use peer to peer for MPI collectives

Good morning everybody!

I’ve been trying to study the performance of openmpi with cuda support on a DGX-A100 machine and the Italian Leonardo HPC cluster.

I profiled some calls with nsys and found that for MPI_Send/Recv/Isend/Irecv the cudaMemcpyP2P is effectively used and performances are as expected.

The problem is that for scientific applications I will need to use MPI collectives, in particular the MPI_Reduce, and I’ve found that performances are slower than MPI on the CPUs. By profiling the code with nsys, I’ve found that for each reduce operation memcpyH2D and D2H are performed, such that the reduce is actually performed with the device buffer, but because P2P is not used the performances are not good.

Is there a problem with MPI collectives or maybe I did something wrong?

I attach the code to this message.

Thank you in advance!

reduce_gpu.txt (4.2 KB)

This may be a function of the specific MPI you are using and how it was compiled. CUDA aware MPI doesn’t make any requirements about how the data is transferred, or when. The only requirement is that device pointers have to be accepted/usable. Other than that, everything else is an implementation detail.

So if MPI_SendRecv is using P2P but MPI_Reduce is not, it is likely due to how the MPI you are using is designed or possibly some compilation settings used when MPI was built.

You may need to address this issue with your cluster administration.