Good reference/examples for CUDA fortran with MPI, please?

Hello everyone:

I would like to use both CUDA Fortran with MPI.

Could anyone suggest any good references or examples for this, please?

Thank you,
Sincerely,
Erin

1 Like

Hi Erin,

Glad to hear from you!

Sorry, I don’t have anything offhand, but the two are largely separate so there’s nothing special. The only two things would be if you’re wanting to use CUDA Aware MPI and per rank device assignment.

To use CUDA Aware MPI, simply pass the device pointers to the MPI calls. You’ll need to have an MPI that has support of CUDA Aware MPI, but both the OpenMPI and HPCX we ship have this enabled by default.

There’s a few ways to do rank to device assignment.

  1. Write a wrapper script which sets the environment variable CUDA_VISIBLE_DEVICE to the local rank id.

  2. Add a call to cudaSetDevice

Here’s an example on how to do this. I use the local rank id and then do a mod operation on the number of devices to round robin the assignment. For example:

% cat test.CUF

program test
  use mpi
  use cudafor
  implicit none
  integer  :: nranks, myrank
  integer :: dev, devNum, local_rank, local_comm, ierr

  call mpi_init(ierr)
  call mpi_comm_size(mpi_comm_world,nranks,ierr)
  call mpi_comm_rank(mpi_comm_world,myrank,ierr)

  call MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, &
       MPI_INFO_NULL, local_comm,ierr)
  call MPI_Comm_rank(local_comm, local_rank,ierr)
  ierr = cudaGetDeviceCount(devNum)
  dev = mod(local_rank,devNum)
  ierr = cudaSetDevice(dev)

  if (local_rank .eq. 0) then
      print *, "Number of devices: ", devNum
  endif
  print *, "Rank #",myrank," using device ", dev
  call MPI_finalize(ierr)

end program test
% mpif90 test.CUF
% mpirun -np 4 a.out
 Rank #            3  using device             1
 Rank #            2  using device             0
 Number of devices:             2
 Rank #            0  using device             0
 Rank #            1  using device             1

Hope this helps,
Mat

1 Like

This is great!

Thank you so much!

Sincerely,
Erin

Hi Mat,

Sorry for the silly question: why do we need to use MPI_Comm_split_type since we can do the same thing without using it? see the following code stemming from your code above:

program test
use mpi
use cudafor
implicit none
integer :: nranks, myrank
integer :: dev, devNum, local_rank, local_comm, ierr

call mpi_init(ierr)
call mpi_comm_size(mpi_comm_world,nranks,ierr)
call mpi_comm_rank(mpi_comm_world,myrank,ierr)

!call MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, &
! MPI_INFO_NULL, local_comm,ierr)
!call MPI_Comm_rank(local_comm, local_rank,ierr)

ierr = cudaGetDeviceCount(devNum)

!dev = mod(local_rank,devNum)
dev = mod(myrank,devNum)

ierr = cudaSetDevice(dev)

!if (local_rank .eq. 0) then
if (myrank .eq. 0) then
print *, "Number of devices: “, devNum
endif
print *, “Rank #”,myrank,” using device ", dev
call MPI_finalize(ierr)

end program test

Thank you!

Sincerely,

Honggang Wang.

What you have is fine if you’re only using a single node when there’s one rank per GPU. For multi-node, the global rank id wont map to the correct GPU id. In particular if the number of GPUs varies or if you are oversubscribing the GPUs.

Feel free to use what ever method you prefer and works for your system. I just prefer using the local_rank since it handle more cases.

Thank you Mat.

I just test the two version, one use local rank (mpi_LR), the other not (mpi).

When I run them in 2xRTX3060, I found mpi_LR is about 2 times faster than mpi (even with 1 process, mpi_LR is faster than mpi).

Do you know why does this happen?

I read a block somewhere which says that mpi_LR (use MPI_Comm_split_type ) will create shared memory between processes so that the simulation will be more efficent, but not exactly know how this is done.

Thanks.

Sincerely,

Honggang Wang.

In another case, when I run them in a single GPU (RTX4000) machine, mpi_LR is slower.

Interesting?

Thanks.

I got this information from Multiple GPU programming with MPI — GPU programming: why, when and how? documentation :

! Split the world communicator into subgroups of commu, each of which
! contains processes that run on the same node, and which can create a
** ! shared memory region** (via the type MPI_COMM_TYPE_SHARED).
! The call returns a new communicator “host_comm”, which is created by
! each subgroup.
call MPI_COMM_SPLIT_TYPE(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0,&
MPI_INFO_NULL, host_comm,ierr)
call MPI_COMM_RANK(host_comm, host_rank,ierr)

I’m not surprised to see very similar code given this has been best practice for quite awhile. I’ve been using it for probably 10+ years.

As for performance, are you doing a significant amount of MPI communication? Yes using a local shared memory buffer to transfer between ranks on the same node can be faster, but 2x seems like a lot. Also, I’d expect both systems to see the same boost (not the opposite). Hence I’m suspect that this is the root cause.

I’d double check that rank to GPU mapping is the same. Also, Nsight-Systems in addition to GPU profiling can profile MPI communication (via “nsys profile --trace=mpi,openacc mpirun <mpi_opts> a.out”). So you should profile each run to see where the difference is coming from.

I find it easier to review the results by looking at the timelines which means viewing in the GUI. For me this means running nsys from the command line and then copying the profile to my laptop.

Thank you Mat,

After recompiling and testing, I got very close performance.

Thanks.

Sincerely,

Honggang Wang.