Good reference/examples for CUDA fortran with MPI, please?

erinm.hodgess · August 21, 2023, 3:27am

Hello everyone:

I would like to use both CUDA Fortran with MPI.

Could anyone suggest any good references or examples for this, please?

Thank you,
Sincerely,
Erin

MatColgrove · August 21, 2023, 3:55pm

Hi Erin,

Glad to hear from you!

Sorry, I don’t have anything offhand, but the two are largely separate so there’s nothing special. The only two things would be if you’re wanting to use CUDA Aware MPI and per rank device assignment.

To use CUDA Aware MPI, simply pass the device pointers to the MPI calls. You’ll need to have an MPI that has support of CUDA Aware MPI, but both the OpenMPI and HPCX we ship have this enabled by default.

There’s a few ways to do rank to device assignment.

Write a wrapper script which sets the environment variable CUDA_VISIBLE_DEVICE to the local rank id.
Add a call to cudaSetDevice

Here’s an example on how to do this. I use the local rank id and then do a mod operation on the number of devices to round robin the assignment. For example:

% cat test.CUF

program test
  use mpi
  use cudafor
  implicit none
  integer  :: nranks, myrank
  integer :: dev, devNum, local_rank, local_comm, ierr

  call mpi_init(ierr)
  call mpi_comm_size(mpi_comm_world,nranks,ierr)
  call mpi_comm_rank(mpi_comm_world,myrank,ierr)

  call MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, &
       MPI_INFO_NULL, local_comm,ierr)
  call MPI_Comm_rank(local_comm, local_rank,ierr)
  ierr = cudaGetDeviceCount(devNum)
  dev = mod(local_rank,devNum)
  ierr = cudaSetDevice(dev)

  if (local_rank .eq. 0) then
      print *, "Number of devices: ", devNum
  endif
  print *, "Rank #",myrank," using device ", dev
  call MPI_finalize(ierr)

end program test
% mpif90 test.CUF
% mpirun -np 4 a.out
 Rank #            3  using device             1
 Rank #            2  using device             0
 Number of devices:             2
 Rank #            0  using device             0
 Rank #            1  using device             1

Hope this helps,
Mat

erinm.hodgess · August 25, 2023, 4:13am

This is great!

Thank you so much!

Sincerely,
Erin

honggangwang1979 · April 4, 2024, 5:16pm

Hi Mat,

Sorry for the silly question: why do we need to use MPI_Comm_split_type since we can do the same thing without using it? see the following code stemming from your code above:

program test
use mpi
use cudafor
implicit none
integer :: nranks, myrank
integer :: dev, devNum, local_rank, local_comm, ierr

call mpi_init(ierr)
call mpi_comm_size(mpi_comm_world,nranks,ierr)
call mpi_comm_rank(mpi_comm_world,myrank,ierr)

!call MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, &
! MPI_INFO_NULL, local_comm,ierr)
!call MPI_Comm_rank(local_comm, local_rank,ierr)

ierr = cudaGetDeviceCount(devNum)

!dev = mod(local_rank,devNum)
dev = mod(myrank,devNum)

ierr = cudaSetDevice(dev)

!if (local_rank .eq. 0) then
if (myrank .eq. 0) then
print *, "Number of devices: “, devNum
endif
print *, “Rank #”,myrank,” using device ", dev
call MPI_finalize(ierr)

end program test

Thank you!

Sincerely,

Honggang Wang.

MatColgrove · April 4, 2024, 7:09pm

What you have is fine if you’re only using a single node when there’s one rank per GPU. For multi-node, the global rank id wont map to the correct GPU id. In particular if the number of GPUs varies or if you are oversubscribing the GPUs.

Feel free to use what ever method you prefer and works for your system. I just prefer using the local_rank since it handle more cases.

honggangwang1979 · April 4, 2024, 7:17pm

Thank you Mat.

I just test the two version, one use local rank (mpi_LR), the other not (mpi).

When I run them in 2xRTX3060, I found mpi_LR is about 2 times faster than mpi (even with 1 process, mpi_LR is faster than mpi).

Do you know why does this happen?

I read a block somewhere which says that mpi_LR (use MPI_Comm_split_type ) will create shared memory between processes so that the simulation will be more efficent, but not exactly know how this is done.

Thanks.

Sincerely,

Honggang Wang.

honggangwang1979 · April 4, 2024, 7:18pm

In another case, when I run them in a single GPU (RTX4000) machine, mpi_LR is slower.

Interesting?

Thanks.

honggangwang1979 · April 4, 2024, 7:22pm

I got this information from Multiple GPU programming with MPI — GPU programming: why, when and how? documentation :

! Split the world communicator into subgroups of commu, each of which
! contains processes that run on the same node, and which can create a
** ! shared memory region** (via the type MPI_COMM_TYPE_SHARED).
! The call returns a new communicator “host_comm”, which is created by
! each subgroup.
call MPI_COMM_SPLIT_TYPE(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0,&
MPI_INFO_NULL, host_comm,ierr)
call MPI_COMM_RANK(host_comm, host_rank,ierr)

MatColgrove · April 4, 2024, 7:49pm

I’m not surprised to see very similar code given this has been best practice for quite awhile. I’ve been using it for probably 10+ years.

As for performance, are you doing a significant amount of MPI communication? Yes using a local shared memory buffer to transfer between ranks on the same node can be faster, but 2x seems like a lot. Also, I’d expect both systems to see the same boost (not the opposite). Hence I’m suspect that this is the root cause.

I’d double check that rank to GPU mapping is the same. Also, Nsight-Systems in addition to GPU profiling can profile MPI communication (via “nsys profile --trace=mpi,openacc mpirun <mpi_opts> a.out”). So you should profile each run to see where the difference is coming from.

I find it easier to review the results by looking at the timelines which means viewing in the GUI. For me this means running nsys from the command line and then copying the profile to my laptop.

honggangwang1979 · April 5, 2024, 4:00am

Thank you Mat,

After recompiling and testing, I got very close performance.

Thanks.

Sincerely,

Honggang Wang.

Topic		Replies	Views
An Introduction to CUDA-Aware MPI Technical Blog	5	958	August 30, 2019
Multi-GPU Unified Memory and Communication nvc, nvc++ and nvfortran	4	757	October 27, 2023
CUDA Fortran Book Memory Allocation Error Legacy PGI Compilers	5	3968	April 18, 2020
Beginning Question about CUDA-aware MPI nvc, nvc++ and nvfortran	5	44	April 26, 2025
Unusually slow MPI communication between GPUs nvc, nvc++ and nvfortran	1	513	September 5, 2023
Running CUDA-Fortran on multiple GPU nodes nvc, nvc++ and nvfortran	4	809	March 12, 2021
CUDA+MPI. Are they compattible in PGI Fortran? Legacy PGI Compilers	5	4734	June 30, 2011
Using multiple GPUs Legacy PGI Compilers	7	22085	August 11, 2009
Benchmarking CUDA-Aware MPI Technical Blog	16	1387	August 20, 2019
Running nvidia Fortran on multiple GPUs with MPI nvc, nvc++ and nvfortran	36	294	December 14, 2024

Good reference/examples for CUDA fortran with MPI, please?

Related topics