Hi garcfd,
Most multi-gpu OpenACC codes use MPI since it’s the most straightforward and simplest method. Basically all you need to do is add OpenACC to your MPI enabled code. Then assigned each rank to a device (one device per rank), For more devices, launch more ranks.
The only MPI OpenACC “things” are the rank to device assignment and optionally use CUDA Aware MPI so device data transfers are done directly between devices. Otherwise the two models don’t overlap and can be used concurrently.
Here’s the boiler plate code I use for device assignment. Typically done right after MPI_init is called:
#ifdef _OPENACC
use openacc
#endif
...
#ifdef _OPENACC
integer :: dev, devNum, local_rank, local_comm, ierr
integer(acc_device_kind) :: devtype
#endif
...
#ifdef _OPENACC
!
! ****** Set the Accelerator device number based on local rank
!
call MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, &
MPI_INFO_NULL, local_comm,ierr)
call MPI_Comm_rank(local_comm, local_rank,ierr)
devtype = acc_get_device_type()
devNum = acc_get_num_devices(devtype)
dev = mod(local_rank,devNum)
call acc_set_device_num(dev, devtype)
# endif
This does a round robin assignment of the device for the ranks on the local system. Another method folks use is to write a wrapper script which sets the environment variable CUDA_DEVICE_VISIBLE so each rank only “sees” a single device.
To use CUDA Aware MPI, put the MPI Send and Receive calls within an OpenACC host_data region so the device addresses are passed.
!$acc host_data use_device(sendbuf)
...
call mpi_send(....
...
!$acc end host_data
To compile, just add the “-acc” flags to your mpifort command. Though be sure to use an MPI that’s configured for use with nvfortran or another compiler that supports OpenACC.
I did write this using MPI+OpenACC tutorial but I haven’t updated it in about 10 years, which was pre-MPI3 and CUDA Aware MPI so doesn’t have my updated device assignment nor uses host_data. Though it might better clarify the process.
I was going to point you to Ron Caplan’s POT3d code as an example, but it looks like he’s no longer posting the OpenACC version, just the Fortran STDPAR (i.e. DO CONNCURRENT) version. Though, he’s also has posted a simple MPI+OpenACC example you might look at.
The SPEChpc 2021 benchmark suite is also a good example resource for MPI+OpenACC (as well as MPI+OpenMP), though while free for academic and non-commercial use, your organization would need to apply for a license.
Hope this helps,
Mat