About the inefficiency of the CUDA-aware GPU-to-GPU communication

Hi,
I am doing a test about the efficiency of the CUDA-aware GPU-to-GPU communication on our small computing platform, and nvhpc/23.5 is being used. I have tested the code offered by mukul1992:

https://github.com/mukul1992/ctest_gpu_mpi

Different from his/her results in “mpi.log” where the Fortran code of GPU-to-GPU is 4x slower than CPU and C code of GPU-to-GPU is 12x faster than CPU, both Fortran and C codes on our platform show that GPU-to-GPU is 3x slower than CPU. I have also tested a Jacobi iteration code and got similar results: GPU-to-GPU using CUDA-aware MPI is much slower. I have checked

ompi_info --all | grep btl_openib_have_cuda_gdr

and it is true, but

ompi_info --all | grep btl_openib_have_driver_gdr

is false. Is this the reason for the inefficiency of the CUDA-aware GPU-to-GPU communication? If so, how to ensure “btl_openib_have_driver_gdr = true” in the nvhpc sdk? I have seen this parameter from “FAQ: Running CUDA-aware Open MPI”, which says “GPUDirect RDMA” may need btl_openib_have_driver_gdr to be true?
Thanks in advance.

Zhuang

Hi Zhuang,

Interesting question and not something I’ve looked at before so don’t have great insight here. We’d need to seek elsewhere to get a more definiative answer, but here’s what I can tell from my experimentation, reading documentation and looking at the OpenMPI code.

Is this the reason for the inefficiency of the CUDA-aware GPU-to-GPU communication?

Again I’m not positive but don’t think this is the issue. This setting appears to only apply to how the OFED stack is built.

When I profile the code with Nsight-Systems, there’s no Device to Device communication, rather it appears the buffer is brought back to host to perform the reduction.

This can be confirmed by looking at the OpenMPI source:

I does appear that the reduce is performed on the host which the extra copy accounts for the performance difference.

It’s not unreasonable why OpenMPI implemented it this way. While I was not involved so don’t really know, but I suspect given the reduction needs to be computed, the library would need to launch a CUDA kernel to do this which in turn would allow device to device communication. This may or may not be possible or may not be as efficient.

Note that I can confirm other APIs such as MPI_Send/Recv, MPI_ISend/IRecv do perform device-to-device communication.

-Mat

Hi Mat,
Thanks very much for your comments. I have tested MPI_Send/MPI_Recv, and it is true that the GPU-to-GPU communication is much faster than CPU (about 8x). In addition, I have tested again my code for Jacobi iteration (see attached).
jac_gpu_g2g.F90 (6.8 KB)
In this code, I have used user-defined MPI type “row” and “col”. “col” is contiguous and “row” is not contiguous.
Results for CPU

col time step :             1   1.3610000000000000E-004
 row time step :             1   9.7930999999999990E-003
 col time step :             2   6.7899999999999997E-005
 row time step :             2   2.5761000000000000E-003
 col time step :             3   4.6999999999999997E-005
 row time step :             3   3.1399999999999999E-004
 col time step :             4   7.0100000000000002E-004
 row time step :             4   3.1589999999999998E-004
 col time step :             5   5.1999999999999997E-005
 row time step :             5   3.1799999999999998E-004
 Initialization time:   0.1096659000000000
 Communication time:    1.4635899999999999E-002
 Col communication time:    1.0039999999999999E-003
 Row communication time:    1.3317100000000000E-002
 Computation time:      0.6328073000000000
 Total elapsed time:    0.7724179999999999

Results for GPU-to-GPU

col time step :            1   0.6048249999999999
 row time step :            1    1.077579100000000
 col time step :            2   6.3600999999999996E-003
 row time step :            2    1.106138000000000
 col time step :            3   2.8990000000000000E-004
 row time step :            3    1.110568000000000
 col time step :            4   4.2488999999999999E-003
 row time step :            4    1.162841000000000
 col time step :            5   2.7609999999999999E-004
 row time step :            5    1.097152900000000
 Initialization time:   5.0577100000000000E-002
 Communication time:     6.170643800000000
 Col communication time:    0.6160000000000000
 Row communication time:     5.554279000000000
 Computation time:      6.5922999999999997E-003
 Total elapsed time:     7.208886100000000

The above results show that when using CPU, “row” communication is a little slower than “col” communication. But for GPU-to-GPU, “row” communication is much slower than “col” communication. Besides, the first time of “col” communication is also much slower than the following steps. This is just a test code for illustration. I can modify my code using just “MPI_Send/MPI_Recv” interfaces, so thanks very much for your comments.

Hi liuzhuang,

I wasn’t sure if this was a question or just an observation, but sending non-contiguous data segments will likely be slower than sending contiguous block. Typically if the data is non-contiguous, such as when passing halos, folks will pack the data into a contiguous buffer, send the buffer, then unpack it.

-Mat