Beginning Question about CUDA-aware MPI

Hi,

I am now studying CUDA-aware MPI to further boost my models’ running speed. I have some basic questions.

  1. With openacc or stdpar, a variable could represent both host variable and device variable. How do MPI know which variable to send, host variable or device variable? Are there any different MPI rules between -gpu=managed and -gpu=nomanaged ? For example in code below, is variable A sent by MPI_Send() function the device variable?
    integer, parameter :: N = 10, steps = 180
    integer,allocatable :: A(:), B(:)
     Allocate(A(N), B(N))
...
!$ACC ENTER DATA COPYIN(A,B)    
        if (rank == 0) then
            ! On GPU 0
            !$acc parallel loop present(A)
            do i = 1, N
                A(i) = A(i) + 1
            end do

            ! Send Array A to GPU 1
            call MPI_Send(A, N, MPI_INTEGER, 1, 0, MPI_COMM_WORLD, ierr)

            ! Recieve Array B from GPU 1
            call MPI_Recv(B, N, MPI_INTEGER, 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
        else if (rank == 1) then
            ! On GPU 1
            do concurrent (i = 1:N)
                B(i) = B(i) - 1
            end do

            ! Recieve Array A from GPU 0
            call MPI_Recv(A, N, MPI_INTEGER, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)

            ! Send Array B to GPU 1
            call MPI_Send(B, N, MPI_INTEGER, 0, 0, MPI_COMM_WORLD, ierr)
        end if
  1. Is there a simple code to test whether my compiler support CUDA-aware MPI? Which MPI version is better, Openmpi-4.1.8 or Openmpi-5.0.7?

Thanks a lot!

While CUDA aware MPI accepts unified memory addresses, it doesn’t know if the data is up to date, so does the device to host copy and does not use GPU direct communication. I saw at one point they wanted to improve this, but I’m not sure where their at so assume this is still the case.

Instead, you’ll want to explicitly manage the memory via OpenACC data directives, and add the “-gpu=nomanaged” flag. (Note the flag name changed recently to “-gpu=mem:separate”).

You then need to also add a host data region so the device address is passed to the MPI calls. Then the MPI calls will detect the device address and perform the GPU direct communication.

For example:

!$acc host_data use_device(A,B)
            ! Send Array A to GPU 1
            call MPI_Send(A, N, MPI_INTEGER, 1, 0, MPI_COMM_WORLD, ierr)

            ! Recieve Array B from GPU 1
            call MPI_Recv(B, N, MPI_INTEGER, 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
!$acc end host_data

Is there a simple code to test whether my compiler support CUDA-aware MPI?

I suggest you use Nsight-Systems to profile the code using the command line “-t mpi” flag. This will automatically add profile support for OpenMPI calls and show you if the calls are GPU direct.

Which MPI version is better, Openmpi-4.1.8 or Openmpi-5.0.7?

We’ve had some issues with MPI 5.0.7 so haven’t shipped it with the NVHPC SDK as of yet. Though the two MPI builds we do ship, OpenMP 4.2 and HPC-X, work well with CUDA Aware MPI.

Thanks a lot!

I’ll try HPC-X. It seems HPC-X is natively from NVidia. It should be the best compatible.
I found a CUDA-aware MPI blog says 5 MPI software ( MVAPICH2 1.8/1.9b, OpenMPI 1.7 (beta), CRAY MPI (MPT 5.6.2), IBM Platform MPI (8.3), SGI MPI (1.08) ) support CUDA-aware MPI. So I considered OpenMPI before.

Also I’ll try to study and use Nsight-Systems later on.

I tried 3 versions of HPC-X.

  1. hpcx-v2.14-gcc-MLNX_OFED_LINUX-5-redhat7-cuda11-gdrcopy2-nccl2.16-x86_64.tbz
  2. hpcx-v2.18.1-gcc-mlnx_ofed-redhat7-cuda12-x86_64.tbz
  3. hpcx-v2.21.2-gcc-inbox-redhat8-cuda12-x86_64.tbz
    They all report errors like below:

[shwa9@gpu10 SC]$ mpif90 -stdpar=gpu -acc=gpu -gpu=nomanaged SC6.f90 -o SC6
NVFORTRAN-F-0004-Corrupt or Old Module file /public/home/users/shwa9/local/hpcx-v2.14-gcc-MLNX_OFED_LINUX-5-redhat7-cuda11-gdrcopy2-nccl2.16-x86_64/ompi/lib/mpi.mod (SC6.f90: 5)
NVFORTRAN/x86-64 Linux 24.5-1: compilation aborted
[shwa9@gpu10 SC]$

The system version, CUDA version, and nvfortran version are below.

[shwa9@gpu10 SC]$ cat /etc/os-release
NAME=“CentOS Linux”
VERSION=“7 (Core)”
ID=“centos”
ID_LIKE=“rhel fedora”
VERSION_ID=“7”
PRETTY_NAME=“CentOS Linux 7 (Core)”
ANSI_COLOR=“0;31”
CPE_NAME=“cpe:/o:centos:centos:7”
HOME_URL=“https://www.centos.org/
BUG_REPORT_URL=“https://bugs.centos.org/

CENTOS_MANTISBT_PROJECT=“CentOS-7”
CENTOS_MANTISBT_PROJECT_VERSION=“7”
REDHAT_SUPPORT_PRODUCT=“centos”
REDHAT_SUPPORT_PRODUCT_VERSION=“7”

[shwa9@gpu10 SC]$ nvidia-smi
Sun Apr 20 15:04:24 2025
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.12 Driver Version: 550.90.12 CUDA Version: 12.4 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla V100-PCIE-16GB-LS Off | 00000000:18:00.0 Off | 0 |
| N/A 57C P0 202W / 250W | 14210MiB / 16384MiB | 98% E. Process |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 1 Tesla V100-PCIE-16GB-LS Off | 00000000:3B:00.0 Off | 0 |
| N/A 65C P0 217W / 250W | 14210MiB / 16384MiB | 97% E. Process |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 2 Tesla V100-PCIE-16GB-LS Off | 00000000:86:00.0 Off | 0 |
| N/A 55C P0 200W / 250W | 14210MiB / 16384MiB | 98% E. Process |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 3 Tesla V100-PCIE-16GB-LS Off | 00000000:AF:00.0 Off | 0 |
| N/A 57C P0 196W / 250W | 14210MiB / 16384MiB | 98% E. Process |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+

±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 269514 C python 14206MiB |
| 1 N/A N/A 269517 C python 14206MiB |
| 2 N/A N/A 269525 C python 14206MiB |
| 3 N/A N/A 269532 C python 14206MiB |
±----------------------------------------------------------------------------------------+

[shwa9@gpu10 apps]$ mpif90 -V

nvfortran 24.5-1 64-bit target on x86-64 Linux -tp skylake-avx512
NVIDIA Compilers and Tools
Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

So which HPC-X version should I download and use?

Thanks!

Since the packages include “gcc” in their name, I’m presuming that they were built with gfortran. Fortran modules aren’t compatible between compilers, and why you’re getting the module error.

You’ll need to use a HPC-X package that was built with nvfortran such as the one that ships with the NVHPC SDK.

I got it! MPI is natively included in the NVHPC SDK. If I can use nvfortran, I don’t need to download MPI anymore.

I tested “!$acc host_data use_device(A,B)”. It gives correct results which proves that CUDA-aware MPI works.

Many thanks!

Besides, I found another great tool NVSHMEM in NVHPC SDK. If NVSHMEM can totally replace MPI, I might move to NVSHMEM. But it seems there are not many beginner examples of NVSHMEM or migration manuals. MPI might still be the most convenient. :-D