Hi,
I am doing a test about the efficiency of the CUDA-aware GPU-to-GPU communication on our small computing platform, and nvhpc/23.5 is being used. I have tested the code offered by mukul1992:
https://github.com/mukul1992/ctest_gpu_mpi
Different from his/her results in “mpi.log” where the Fortran code of GPU-to-GPU is 4x slower than CPU and C code of GPU-to-GPU is 12x faster than CPU, both Fortran and C codes on our platform show that GPU-to-GPU is 3x slower than CPU. I have also tested a Jacobi iteration code and got similar results: GPU-to-GPU using CUDA-aware MPI is much slower. I have checked
ompi_info --all | grep btl_openib_have_cuda_gdr
and it is true, but
ompi_info --all | grep btl_openib_have_driver_gdr
is false. Is this the reason for the inefficiency of the CUDA-aware GPU-to-GPU communication? If so, how to ensure “btl_openib_have_driver_gdr = true” in the nvhpc sdk? I have seen this parameter from “FAQ: Running CUDA-aware Open MPI”, which says “GPUDirect RDMA” may need btl_openib_have_driver_gdr to be true?
Thanks in advance.
Zhuang
Hi Zhuang,
Interesting question and not something I’ve looked at before so don’t have great insight here. We’d need to seek elsewhere to get a more definiative answer, but here’s what I can tell from my experimentation, reading documentation and looking at the OpenMPI code.
Is this the reason for the inefficiency of the CUDA-aware GPU-to-GPU communication?
Again I’m not positive but don’t think this is the issue. This setting appears to only apply to how the OFED stack is built.
When I profile the code with Nsight-Systems, there’s no Device to Device communication, rather it appears the buffer is brought back to host to perform the reduction.
This can be confirmed by looking at the OpenMPI source:
I does appear that the reduce is performed on the host which the extra copy accounts for the performance difference.
It’s not unreasonable why OpenMPI implemented it this way. While I was not involved so don’t really know, but I suspect given the reduction needs to be computed, the library would need to launch a CUDA kernel to do this which in turn would allow device to device communication. This may or may not be possible or may not be as efficient.
Note that I can confirm other APIs such as MPI_Send/Recv, MPI_ISend/IRecv do perform device-to-device communication.
-Mat
Hi Mat,
Thanks very much for your comments. I have tested MPI_Send/MPI_Recv, and it is true that the GPU-to-GPU communication is much faster than CPU (about 8x). In addition, I have tested again my code for Jacobi iteration (see attached).
jac_gpu_g2g.F90 (6.8 KB)
In this code, I have used user-defined MPI type “row” and “col”. “col” is contiguous and “row” is not contiguous.
Results for CPU
col time step : 1 1.3610000000000000E-004
row time step : 1 9.7930999999999990E-003
col time step : 2 6.7899999999999997E-005
row time step : 2 2.5761000000000000E-003
col time step : 3 4.6999999999999997E-005
row time step : 3 3.1399999999999999E-004
col time step : 4 7.0100000000000002E-004
row time step : 4 3.1589999999999998E-004
col time step : 5 5.1999999999999997E-005
row time step : 5 3.1799999999999998E-004
Initialization time: 0.1096659000000000
Communication time: 1.4635899999999999E-002
Col communication time: 1.0039999999999999E-003
Row communication time: 1.3317100000000000E-002
Computation time: 0.6328073000000000
Total elapsed time: 0.7724179999999999
Results for GPU-to-GPU
col time step : 1 0.6048249999999999
row time step : 1 1.077579100000000
col time step : 2 6.3600999999999996E-003
row time step : 2 1.106138000000000
col time step : 3 2.8990000000000000E-004
row time step : 3 1.110568000000000
col time step : 4 4.2488999999999999E-003
row time step : 4 1.162841000000000
col time step : 5 2.7609999999999999E-004
row time step : 5 1.097152900000000
Initialization time: 5.0577100000000000E-002
Communication time: 6.170643800000000
Col communication time: 0.6160000000000000
Row communication time: 5.554279000000000
Computation time: 6.5922999999999997E-003
Total elapsed time: 7.208886100000000
The above results show that when using CPU, “row” communication is a little slower than “col” communication. But for GPU-to-GPU, “row” communication is much slower than “col” communication. Besides, the first time of “col” communication is also much slower than the following steps. This is just a test code for illustration. I can modify my code using just “MPI_Send/MPI_Recv” interfaces, so thanks very much for your comments.
Hi liuzhuang,
I wasn’t sure if this was a question or just an observation, but sending non-contiguous data segments will likely be slower than sending contiguous block. Typically if the data is non-contiguous, such as when passing halos, folks will pack the data into a contiguous buffer, send the buffer, then unpack it.
-Mat