Good morning everybody!
I’ve been trying to study the performance of openmpi with cuda support on a DGX-A100 machine and the Italian Leonardo HPC cluster.
I profiled some calls with nsys and found that for MPI_Send/Recv/Isend/Irecv the cudaMemcpyP2P is effectively used and performances are as expected.
The problem is that for scientific applications I will need to use MPI collectives, in particular the MPI_Reduce, and I’ve found that performances are slower than MPI on the CPUs. By profiling the code with nsys, I’ve found that for each reduce operation memcpyH2D and D2H are performed, such that the reduce is actually performed with the device buffer, but because P2P is not used the performances are not good.
Is there a problem with MPI collectives or maybe I did something wrong?
I attach the code to this message.
Thank you in advance!
reduce_gpu.txt (4.2 KB)