Beginning Question about CUDA-aware MPI

chenbr · April 6, 2025, 3:01pm

Hi,

I am now studying CUDA-aware MPI to further boost my models’ running speed. I have some basic questions.

With openacc or stdpar, a variable could represent both host variable and device variable. How do MPI know which variable to send, host variable or device variable? Are there any different MPI rules between -gpu=managed and -gpu=nomanaged ? For example in code below, is variable A sent by MPI_Send() function the device variable?

    integer, parameter :: N = 10, steps = 180
    integer,allocatable :: A(:), B(:)
     Allocate(A(N), B(N))
...
!$ACC ENTER DATA COPYIN(A,B)    
        if (rank == 0) then
            ! On GPU 0
            !$acc parallel loop present(A)
            do i = 1, N
                A(i) = A(i) + 1
            end do

            ! Send Array A to GPU 1
            call MPI_Send(A, N, MPI_INTEGER, 1, 0, MPI_COMM_WORLD, ierr)

            ! Recieve Array B from GPU 1
            call MPI_Recv(B, N, MPI_INTEGER, 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
        else if (rank == 1) then
            ! On GPU 1
            do concurrent (i = 1:N)
                B(i) = B(i) - 1
            end do

            ! Recieve Array A from GPU 0
            call MPI_Recv(A, N, MPI_INTEGER, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)

            ! Send Array B to GPU 1
            call MPI_Send(B, N, MPI_INTEGER, 0, 0, MPI_COMM_WORLD, ierr)
        end if

Is there a simple code to test whether my compiler support CUDA-aware MPI? Which MPI version is better, Openmpi-4.1.8 or Openmpi-5.0.7?

Thanks a lot!

MatColgrove · April 7, 2025, 3:56pm

While CUDA aware MPI accepts unified memory addresses, it doesn’t know if the data is up to date, so does the device to host copy and does not use GPU direct communication. I saw at one point they wanted to improve this, but I’m not sure where their at so assume this is still the case.

Instead, you’ll want to explicitly manage the memory via OpenACC data directives, and add the “-gpu=nomanaged” flag. (Note the flag name changed recently to “-gpu=mem:separate”).

You then need to also add a host data region so the device address is passed to the MPI calls. Then the MPI calls will detect the device address and perform the GPU direct communication.

For example:

!$acc host_data use_device(A,B)
            ! Send Array A to GPU 1
            call MPI_Send(A, N, MPI_INTEGER, 1, 0, MPI_COMM_WORLD, ierr)

            ! Recieve Array B from GPU 1
            call MPI_Recv(B, N, MPI_INTEGER, 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
!$acc end host_data

Is there a simple code to test whether my compiler support CUDA-aware MPI?

I suggest you use Nsight-Systems to profile the code using the command line “-t mpi” flag. This will automatically add profile support for OpenMPI calls and show you if the calls are GPU direct.

Which MPI version is better, Openmpi-4.1.8 or Openmpi-5.0.7?

We’ve had some issues with MPI 5.0.7 so haven’t shipped it with the NVHPC SDK as of yet. Though the two MPI builds we do ship, OpenMP 4.2 and HPC-X, work well with CUDA Aware MPI.

chenbr · April 9, 2025, 4:51am

Thanks a lot!

I’ll try HPC-X. It seems HPC-X is natively from NVidia. It should be the best compatible.
I found a CUDA-aware MPI blog says 5 MPI software ( MVAPICH2 1.8/1.9b, OpenMPI 1.7 (beta), CRAY MPI (MPT 5.6.2), IBM Platform MPI (8.3), SGI MPI (1.08) ) support CUDA-aware MPI. So I considered OpenMPI before.

Also I’ll try to study and use Nsight-Systems later on.

chenbr · April 20, 2025, 8:35am

I tried 3 versions of HPC-X.

hpcx-v2.14-gcc-MLNX_OFED_LINUX-5-redhat7-cuda11-gdrcopy2-nccl2.16-x86_64.tbz
hpcx-v2.18.1-gcc-mlnx_ofed-redhat7-cuda12-x86_64.tbz
hpcx-v2.21.2-gcc-inbox-redhat8-cuda12-x86_64.tbz
They all report errors like below:

[shwa9@gpu10 SC]$ mpif90 -stdpar=gpu -acc=gpu -gpu=nomanaged SC6.f90 -o SC6
NVFORTRAN-F-0004-Corrupt or Old Module file /public/home/users/shwa9/local/hpcx-v2.14-gcc-MLNX_OFED_LINUX-5-redhat7-cuda11-gdrcopy2-nccl2.16-x86_64/ompi/lib/mpi.mod (SC6.f90: 5)
NVFORTRAN/x86-64 Linux 24.5-1: compilation aborted
[shwa9@gpu10 SC]$

The system version, CUDA version, and nvfortran version are below.

[shwa9@gpu10 SC]$ cat /etc/os-release
NAME=“CentOS Linux”
VERSION=“7 (Core)”
ID=“centos”
ID_LIKE=“rhel fedora”
VERSION_ID=“7”
PRETTY_NAME=“CentOS Linux 7 (Core)”
ANSI_COLOR=“0;31”
CPE_NAME=“cpe:/o:centos:centos:7”
HOME_URL=“https://www.centos.org/”
BUG_REPORT_URL=“https://bugs.centos.org/”

CENTOS_MANTISBT_PROJECT=“CentOS-7”
CENTOS_MANTISBT_PROJECT_VERSION=“7”
REDHAT_SUPPORT_PRODUCT=“centos”
REDHAT_SUPPORT_PRODUCT_VERSION=“7”

±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 269514 C python 14206MiB |
| 1 N/A N/A 269517 C python 14206MiB |
| 2 N/A N/A 269525 C python 14206MiB |
| 3 N/A N/A 269532 C python 14206MiB |
±----------------------------------------------------------------------------------------+

[shwa9@gpu10 apps]$ mpif90 -V

So which HPC-X version should I download and use?

Thanks!

MatColgrove · April 21, 2025, 3:34pm

Since the packages include “gcc” in their name, I’m presuming that they were built with gfortran. Fortran modules aren’t compatible between compilers, and why you’re getting the module error.

You’ll need to use a HPC-X package that was built with nvfortran such as the one that ships with the NVHPC SDK.

chenbr · April 26, 2025, 12:29pm

I got it! MPI is natively included in the NVHPC SDK. If I can use nvfortran, I don’t need to download MPI anymore.

I tested “!$acc host_data use_device(A,B)”. It gives correct results which proves that CUDA-aware MPI works.

Many thanks!

Besides, I found another great tool NVSHMEM in NVHPC SDK. If NVSHMEM can totally replace MPI, I might move to NVSHMEM. But it seems there are not many beginner examples of NVSHMEM or migration manuals. MPI might still be the most convenient. :-D

Topic		Replies	Views
Behaviour of OpenMP target maps with Fortran arrays nvc, nvc++ and nvfortran	12	64	February 11, 2025
Issue of Running OpenMPI on Multiple GPU Nodes with InfiniBand nvc, nvc++ and nvfortran openmpi	12	2339	March 11, 2024
MPI send + OpenACC + acc_malloc fail with NVFortran, but work with C nvc, nvc++ and nvfortran	10	101	September 6, 2024
Unusually slow MPI communication between GPUs nvc, nvc++ and nvfortran	1	513	September 5, 2023
An Introduction to CUDA-Aware MPI Technical Blog	5	958	August 30, 2019
Can't compile with OpenMPI 4.1.4, "broken function" nvc, nvc++ and nvfortran	5	896	August 17, 2022
NV 24.1 Default MPI seg faulting on derived type host_data MPI calls - sometimes nvc, nvc++ and nvfortran	15	768	June 6, 2024
Benchmarking CUDA-Aware MPI Technical Blog	16	1387	August 20, 2019
Direct GPU-to-GPU data transfer with OpenACC+managed+MPI nvc, nvc++ and nvfortran	4	1113	April 12, 2022
About the inefficiency of the CUDA-aware GPU-to-GPU communication nvc, nvc++ and nvfortran	20	641	December 20, 2024

Beginning Question about CUDA-aware MPI

Related topics