Hi,
I am doing a test about the efficiency of the CUDA-aware GPU-to-GPU communication on our small computing platform, and nvhpc/23.5 is being used. I have tested the code offered by mukul1992:
https://github.com/mukul1992/ctest_gpu_mpi
Different from his/her results in “mpi.log” where the Fortran code of GPU-to-GPU is 4x slower than CPU and C code of GPU-to-GPU is 12x faster than CPU, both Fortran and C codes on our platform show that GPU-to-GPU is 3x slower than CPU. I have also tested a Jacobi iteration code and got similar results: GPU-to-GPU using CUDA-aware MPI is much slower. I have checked
ompi_info --all | grep btl_openib_have_cuda_gdr
and it is true, but
ompi_info --all | grep btl_openib_have_driver_gdr
is false. Is this the reason for the inefficiency of the CUDA-aware GPU-to-GPU communication? If so, how to ensure “btl_openib_have_driver_gdr = true” in the nvhpc sdk? I have seen this parameter from “FAQ: Running CUDA-aware Open MPI”, which says “GPUDirect RDMA” may need btl_openib_have_driver_gdr to be true?
Thanks in advance.
Interesting question and not something I’ve looked at before so don’t have great insight here. We’d need to seek elsewhere to get a more definiative answer, but here’s what I can tell from my experimentation, reading documentation and looking at the OpenMPI code.
Is this the reason for the inefficiency of the CUDA-aware GPU-to-GPU communication?
Again I’m not positive but don’t think this is the issue. This setting appears to only apply to how the OFED stack is built.
When I profile the code with Nsight-Systems, there’s no Device to Device communication, rather it appears the buffer is brought back to host to perform the reduction.
This can be confirmed by looking at the OpenMPI source:
I does appear that the reduce is performed on the host which the extra copy accounts for the performance difference.
It’s not unreasonable why OpenMPI implemented it this way. While I was not involved so don’t really know, but I suspect given the reduction needs to be computed, the library would need to launch a CUDA kernel to do this which in turn would allow device to device communication. This may or may not be possible or may not be as efficient.
Note that I can confirm other APIs such as MPI_Send/Recv, MPI_ISend/IRecv do perform device-to-device communication.
Hi Mat,
Thanks very much for your comments. I have tested MPI_Send/MPI_Recv, and it is true that the GPU-to-GPU communication is much faster than CPU (about 8x). In addition, I have tested again my code for Jacobi iteration (see attached). jac_gpu_g2g.F90 (6.8 KB)
In this code, I have used user-defined MPI type “row” and “col”. “col” is contiguous and “row” is not contiguous.
Results for CPU
col time step : 1 1.3610000000000000E-004
row time step : 1 9.7930999999999990E-003
col time step : 2 6.7899999999999997E-005
row time step : 2 2.5761000000000000E-003
col time step : 3 4.6999999999999997E-005
row time step : 3 3.1399999999999999E-004
col time step : 4 7.0100000000000002E-004
row time step : 4 3.1589999999999998E-004
col time step : 5 5.1999999999999997E-005
row time step : 5 3.1799999999999998E-004
Initialization time: 0.1096659000000000
Communication time: 1.4635899999999999E-002
Col communication time: 1.0039999999999999E-003
Row communication time: 1.3317100000000000E-002
Computation time: 0.6328073000000000
Total elapsed time: 0.7724179999999999
Results for GPU-to-GPU
col time step : 1 0.6048249999999999
row time step : 1 1.077579100000000
col time step : 2 6.3600999999999996E-003
row time step : 2 1.106138000000000
col time step : 3 2.8990000000000000E-004
row time step : 3 1.110568000000000
col time step : 4 4.2488999999999999E-003
row time step : 4 1.162841000000000
col time step : 5 2.7609999999999999E-004
row time step : 5 1.097152900000000
Initialization time: 5.0577100000000000E-002
Communication time: 6.170643800000000
Col communication time: 0.6160000000000000
Row communication time: 5.554279000000000
Computation time: 6.5922999999999997E-003
Total elapsed time: 7.208886100000000
The above results show that when using CPU, “row” communication is a little slower than “col” communication. But for GPU-to-GPU, “row” communication is much slower than “col” communication. Besides, the first time of “col” communication is also much slower than the following steps. This is just a test code for illustration. I can modify my code using just “MPI_Send/MPI_Recv” interfaces, so thanks very much for your comments.
I wasn’t sure if this was a question or just an observation, but sending non-contiguous data segments will likely be slower than sending contiguous block. Typically if the data is non-contiguous, such as when passing halos, folks will pack the data into a contiguous buffer, send the buffer, then unpack it.
I am trying to figure out why my MPI + GPU code is at least x10 slower than just single GPU alone. Im guessing its all down to the communicated blocks of data. In the code below fout is of size (ni,nj,nk,15) in the serial case. So by contiguous buffer - do you mean I need to put the communicated data into a separate array, rather than passing it as a segment of the full array eg feq(1,:,:,:) ?
Possibly, but I suggest you profile both runs using Nsight-Systems to determine where the slow down comes from. With Nsys, you can add the flag “-t mpi,openacc,cuda” to trace not only the OpenACC and CUDA runtimes, but all the MPI calls.
do you mean I need to put the communicated data into a separate array, rather than passing it as a segment of the full array eg feq(1,:,:,:)
No, it just should be contiguous which this should be.
I tried to use ncu for some profiling but it came back with some permissions issue (because I am doing it on the runpod platform). I have logged it as an issue with them because I tried some of the suggested ways to overcome the problem (on the website below), but non of them worked for me.
I can make a wild guess, but would likely be wrong. Here’s my logic: The segv is occurring in “acc_register_library”. The “acc” could be something generic, but might be and indication that it’s registering the OpenACC runtime calls, or possibly the OpenACC kernels. The profiler folks are on a different team, so I don’t have access to the source, so don’t really know.
I’ve never encountered this myself and it seems to me that registration would be a well exercised bit of code. Hence something about the way you’re running is out of the norm. What’s different is that your running as root under “sudo”. If the environment isn’t inherited, it’s possible the binary is getting dynamically linked to the wrong set of runtime libraries. I’ve seen problems when the loader picks up the GNU OpenACC runtime.
To test this theory, I’d run “sudo ldd ./ufolbm-gpu-mpi”, to see what libs are getting linked in, and compare the output from “sudo env” and “env” to see if the environments are different.
You can also try removing “openacc” from the trace to see if that works around the error.
Now it is more likely the problem is specific to your system and setup. Though we can only test that by successfully running on a system with a different setup.
You may want to take a look at the second edition of the CUDA Fortran book ( even if it is not on OpenACC, there is a chapter on multiGPU programming and several examples on how to use nsys)
Mass, are you sure that’s the problem? I typically use nsys before mpirun so it can profile the MPI calls as well as the whole system. One can certainly profile each rank separately, but that doesn’t give the same view.
In theory, you could use it before but this approach will work only when you have a limited number of ranks. Also, calling it after MPI, will allow some logic to generate a profile only from selected ranks.