About the inefficiency of the CUDA-aware GPU-to-GPU communication

Hi,
I am doing a test about the efficiency of the CUDA-aware GPU-to-GPU communication on our small computing platform, and nvhpc/23.5 is being used. I have tested the code offered by mukul1992:

https://github.com/mukul1992/ctest_gpu_mpi

Different from his/her results in “mpi.log” where the Fortran code of GPU-to-GPU is 4x slower than CPU and C code of GPU-to-GPU is 12x faster than CPU, both Fortran and C codes on our platform show that GPU-to-GPU is 3x slower than CPU. I have also tested a Jacobi iteration code and got similar results: GPU-to-GPU using CUDA-aware MPI is much slower. I have checked

ompi_info --all | grep btl_openib_have_cuda_gdr

and it is true, but

ompi_info --all | grep btl_openib_have_driver_gdr

is false. Is this the reason for the inefficiency of the CUDA-aware GPU-to-GPU communication? If so, how to ensure “btl_openib_have_driver_gdr = true” in the nvhpc sdk? I have seen this parameter from “FAQ: Running CUDA-aware Open MPI”, which says “GPUDirect RDMA” may need btl_openib_have_driver_gdr to be true?
Thanks in advance.

Zhuang

Hi Zhuang,

Interesting question and not something I’ve looked at before so don’t have great insight here. We’d need to seek elsewhere to get a more definiative answer, but here’s what I can tell from my experimentation, reading documentation and looking at the OpenMPI code.

Is this the reason for the inefficiency of the CUDA-aware GPU-to-GPU communication?

Again I’m not positive but don’t think this is the issue. This setting appears to only apply to how the OFED stack is built.

When I profile the code with Nsight-Systems, there’s no Device to Device communication, rather it appears the buffer is brought back to host to perform the reduction.

This can be confirmed by looking at the OpenMPI source:

I does appear that the reduce is performed on the host which the extra copy accounts for the performance difference.

It’s not unreasonable why OpenMPI implemented it this way. While I was not involved so don’t really know, but I suspect given the reduction needs to be computed, the library would need to launch a CUDA kernel to do this which in turn would allow device to device communication. This may or may not be possible or may not be as efficient.

Note that I can confirm other APIs such as MPI_Send/Recv, MPI_ISend/IRecv do perform device-to-device communication.

-Mat

Hi Mat,
Thanks very much for your comments. I have tested MPI_Send/MPI_Recv, and it is true that the GPU-to-GPU communication is much faster than CPU (about 8x). In addition, I have tested again my code for Jacobi iteration (see attached).
jac_gpu_g2g.F90 (6.8 KB)
In this code, I have used user-defined MPI type “row” and “col”. “col” is contiguous and “row” is not contiguous.
Results for CPU

col time step :             1   1.3610000000000000E-004
 row time step :             1   9.7930999999999990E-003
 col time step :             2   6.7899999999999997E-005
 row time step :             2   2.5761000000000000E-003
 col time step :             3   4.6999999999999997E-005
 row time step :             3   3.1399999999999999E-004
 col time step :             4   7.0100000000000002E-004
 row time step :             4   3.1589999999999998E-004
 col time step :             5   5.1999999999999997E-005
 row time step :             5   3.1799999999999998E-004
 Initialization time:   0.1096659000000000
 Communication time:    1.4635899999999999E-002
 Col communication time:    1.0039999999999999E-003
 Row communication time:    1.3317100000000000E-002
 Computation time:      0.6328073000000000
 Total elapsed time:    0.7724179999999999

Results for GPU-to-GPU

col time step :            1   0.6048249999999999
 row time step :            1    1.077579100000000
 col time step :            2   6.3600999999999996E-003
 row time step :            2    1.106138000000000
 col time step :            3   2.8990000000000000E-004
 row time step :            3    1.110568000000000
 col time step :            4   4.2488999999999999E-003
 row time step :            4    1.162841000000000
 col time step :            5   2.7609999999999999E-004
 row time step :            5    1.097152900000000
 Initialization time:   5.0577100000000000E-002
 Communication time:     6.170643800000000
 Col communication time:    0.6160000000000000
 Row communication time:     5.554279000000000
 Computation time:      6.5922999999999997E-003
 Total elapsed time:     7.208886100000000

The above results show that when using CPU, “row” communication is a little slower than “col” communication. But for GPU-to-GPU, “row” communication is much slower than “col” communication. Besides, the first time of “col” communication is also much slower than the following steps. This is just a test code for illustration. I can modify my code using just “MPI_Send/MPI_Recv” interfaces, so thanks very much for your comments.

Hi liuzhuang,

I wasn’t sure if this was a question or just an observation, but sending non-contiguous data segments will likely be slower than sending contiguous block. Typically if the data is non-contiguous, such as when passing halos, folks will pack the data into a contiguous buffer, send the buffer, then unpack it.

-Mat

1 Like

I am trying to figure out why my MPI + GPU code is at least x10 slower than just single GPU alone. Im guessing its all down to the communicated blocks of data. In the code below fout is of size (ni,nj,nk,15) in the serial case. So by contiguous buffer - do you mean I need to put the communicated data into a separate array, rather than passing it as a segment of the full array eg feq(1,:,:,:) ?

!$acc update host(fout)
CALL MPI_bcs(fout)
!$acc update device(fout)

Possibly, but I suggest you profile both runs using Nsight-Systems to determine where the slow down comes from. With Nsys, you can add the flag “-t mpi,openacc,cuda” to trace not only the OpenACC and CUDA runtimes, but all the MPI calls.

do you mean I need to put the communicated data into a separate array, rather than passing it as a segment of the full array eg feq(1,:,:,:)

No, it just should be contiguous which this should be.

I tried to use ncu for some profiling but it came back with some permissions issue (because I am doing it on the runpod platform). I have logged it as an issue with them because I tried some of the suggested ways to overcome the problem (on the website below), but non of them worked for me.

Nsight-Compute (ncu) isn’t going to help you here since it’s looking at the hardware performance counters on a per kernel bases.

You’ll want to use Nsight-Systems (nsys) to get a system level view, including MPI communication, to understand the performance bottle-necks.

I should note that if you do need to use ncu, then you’ll want to talk to your system admin to get you the relevant permission level. See:

nsys is not on the runpod platform (but ncu is there).
Is it possible to install nsys on remote platform?

Whether you can install it, that would be a question for your system admin.

If you are allowed to install software and don’t have the NVHPC SDK installed (nsys is included), then you can download nsys from Nsight Systems - Get Started | NVIDIA Developer

I managed to run this command remotely

sudo apt install nsight-systems

But then couldnt run nsight-systems due to the graphics requirement.

I am not sure if nsys is included within the above installation, or should i use a different command to install nsys?

The other thing I noticed (below) on runpod when using mpirun -np 4, is that the GPU utilization was far from optimal.

I eventually got nsys installed on a remote machine, but when I ran it there were error messages like this:

I am wondering if I could pass you my testcase would you be able to run nsys for me?

I am wondering if I could pass you my testcase would you be able to run nsys for me?

Sure, I can give it a try.

ok great I will email it to you in a moment.
Any idea (from the error messages) why nsys crashed out?

I can make a wild guess, but would likely be wrong. Here’s my logic: The segv is occurring in “acc_register_library”. The “acc” could be something generic, but might be and indication that it’s registering the OpenACC runtime calls, or possibly the OpenACC kernels. The profiler folks are on a different team, so I don’t have access to the source, so don’t really know.

I’ve never encountered this myself and it seems to me that registration would be a well exercised bit of code. Hence something about the way you’re running is out of the norm. What’s different is that your running as root under “sudo”. If the environment isn’t inherited, it’s possible the binary is getting dynamically linked to the wrong set of runtime libraries. I’ve seen problems when the loader picks up the GNU OpenACC runtime.

To test this theory, I’d run “sudo ldd ./ufolbm-gpu-mpi”, to see what libs are getting linked in, and compare the output from “sudo env” and “env” to see if the environments are different.

You can also try removing “openacc” from the trace to see if that works around the error.

Now it is more likely the problem is specific to your system and setup. Though we can only test that by successfully running on a system with a different setup.

In general, you want nsys to profile your code not mpirun:

mpirun -np 4 --bind-to none nsys profile -t mpi,openacc,cuda -o ufolbm_%q{OMPI_COMM_WORLD_RANK} ./ufolbm-gpu-mp

This will generate a report for each rank.

You may want to take a look at the second edition of the CUDA Fortran book ( even if it is not on OpenACC, there is a chapter on multiGPU programming and several examples on how to use nsys)

Mass, are you sure that’s the problem? I typically use nsys before mpirun so it can profile the MPI calls as well as the whole system. One can certainly profile each rank separately, but that doesn’t give the same view.

In theory, you could use it before but this approach will work only when you have a limited number of ranks. Also, calling it after MPI, will allow some logic to generate a profile only from selected ranks.

let me know if you got the files ok?