I am doing a test about the efficiency of the CUDA-aware GPU-to-GPU communication on our small computing platform, and nvhpc/23.5 is being used. I have tested the code offered by mukul1992:
Different from his/her results in “mpi.log” where the Fortran code of GPU-to-GPU is 4x slower than CPU and C code of GPU-to-GPU is 12x faster than CPU, both Fortran and C codes on our platform show that GPU-to-GPU is 3x slower than CPU. I have also tested a Jacobi iteration code and got similar results: GPU-to-GPU using CUDA-aware MPI is much slower. I have checked
ompi_info --all | grep btl_openib_have_cuda_gdr
and it is true, but
ompi_info --all | grep btl_openib_have_driver_gdr
is false. Is this the reason for the inefficiency of the CUDA-aware GPU-to-GPU communication? If so, how to ensure “btl_openib_have_driver_gdr = true” in the nvhpc sdk? I have seen this parameter from “FAQ: Running CUDA-aware Open MPI”, which says “GPUDirect RDMA” may need btl_openib_have_driver_gdr to be true?
Thanks in advance.
Interesting question and not something I’ve looked at before so don’t have great insight here. We’d need to seek elsewhere to get a more definiative answer, but here’s what I can tell from my experimentation, reading documentation and looking at the OpenMPI code.
Is this the reason for the inefficiency of the CUDA-aware GPU-to-GPU communication?
Again I’m not positive but don’t think this is the issue. This setting appears to only apply to how the OFED stack is built.
When I profile the code with Nsight-Systems, there’s no Device to Device communication, rather it appears the buffer is brought back to host to perform the reduction.
This can be confirmed by looking at the OpenMPI source:
I does appear that the reduce is performed on the host which the extra copy accounts for the performance difference.
It’s not unreasonable why OpenMPI implemented it this way. While I was not involved so don’t really know, but I suspect given the reduction needs to be computed, the library would need to launch a CUDA kernel to do this which in turn would allow device to device communication. This may or may not be possible or may not be as efficient.
Note that I can confirm other APIs such as MPI_Send/Recv, MPI_ISend/IRecv do perform device-to-device communication.
Thanks very much for your comments. I have tested MPI_Send/MPI_Recv, and it is true that the GPU-to-GPU communication is much faster than CPU (about 8x). In addition, I have tested again my code for Jacobi iteration (see attached).
jac_gpu_g2g.F90 (6.8 KB)
In this code, I have used user-defined MPI type “row” and “col”. “col” is contiguous and “row” is not contiguous.
Results for CPU
col time step : 1 1.3610000000000000E-004
row time step : 1 9.7930999999999990E-003
col time step : 2 6.7899999999999997E-005
row time step : 2 2.5761000000000000E-003
col time step : 3 4.6999999999999997E-005
row time step : 3 3.1399999999999999E-004
col time step : 4 7.0100000000000002E-004
row time step : 4 3.1589999999999998E-004
col time step : 5 5.1999999999999997E-005
row time step : 5 3.1799999999999998E-004
Initialization time: 0.1096659000000000
Communication time: 1.4635899999999999E-002
Col communication time: 1.0039999999999999E-003
Row communication time: 1.3317100000000000E-002
Computation time: 0.6328073000000000
Total elapsed time: 0.7724179999999999
Results for GPU-to-GPU
col time step : 1 0.6048249999999999
row time step : 1 1.077579100000000
col time step : 2 6.3600999999999996E-003
row time step : 2 1.106138000000000
col time step : 3 2.8990000000000000E-004
row time step : 3 1.110568000000000
col time step : 4 4.2488999999999999E-003
row time step : 4 1.162841000000000
col time step : 5 2.7609999999999999E-004
row time step : 5 1.097152900000000
Initialization time: 5.0577100000000000E-002
Communication time: 6.170643800000000
Col communication time: 0.6160000000000000
Row communication time: 5.554279000000000
Computation time: 6.5922999999999997E-003
Total elapsed time: 7.208886100000000
The above results show that when using CPU, “row” communication is a little slower than “col” communication. But for GPU-to-GPU, “row” communication is much slower than “col” communication. Besides, the first time of “col” communication is also much slower than the following steps. This is just a test code for illustration. I can modify my code using just “MPI_Send/MPI_Recv” interfaces, so thanks very much for your comments.
I wasn’t sure if this was a question or just an observation, but sending non-contiguous data segments will likely be slower than sending contiguous block. Typically if the data is non-contiguous, such as when passing halos, folks will pack the data into a contiguous buffer, send the buffer, then unpack it.