Running nvidia Fortran on multiple GPUs with MPI

I have some nvidia Fortran code that runs fine on single GPU (openacc)

pgf90 -static-nvidia -m64 -acc -gpu=cc75

and I also have some Fortran that runs fine on CPU with MPI

mpif77 -Wall

but what I don’t get, is how to run on multiple GPUs with MPI.

How do you compile such a code? …and is there a demo
example, that shows how to merge both types of code?

Thanks.

Hi garcfd,

Most multi-gpu OpenACC codes use MPI since it’s the most straightforward and simplest method. Basically all you need to do is add OpenACC to your MPI enabled code. Then assigned each rank to a device (one device per rank), For more devices, launch more ranks.

The only MPI OpenACC “things” are the rank to device assignment and optionally use CUDA Aware MPI so device data transfers are done directly between devices. Otherwise the two models don’t overlap and can be used concurrently.

Here’s the boiler plate code I use for device assignment. Typically done right after MPI_init is called:

#ifdef _OPENACC
      use openacc
#endif
...
#ifdef _OPENACC
      integer :: dev, devNum, local_rank, local_comm, ierr
      integer(acc_device_kind) :: devtype
#endif
...
#ifdef _OPENACC
!
! ****** Set the Accelerator device number based on local rank
!
     call MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, &
          MPI_INFO_NULL, local_comm,ierr)
     call MPI_Comm_rank(local_comm, local_rank,ierr)
     devtype = acc_get_device_type()
     devNum = acc_get_num_devices(devtype)
     dev = mod(local_rank,devNum)
     call acc_set_device_num(dev, devtype)
# endif

This does a round robin assignment of the device for the ranks on the local system. Another method folks use is to write a wrapper script which sets the environment variable CUDA_DEVICE_VISIBLE so each rank only “sees” a single device.

To use CUDA Aware MPI, put the MPI Send and Receive calls within an OpenACC host_data region so the device addresses are passed.

!$acc host_data use_device(sendbuf) 
...
   call mpi_send(....
...
!$acc end host_data

To compile, just add the “-acc” flags to your mpifort command. Though be sure to use an MPI that’s configured for use with nvfortran or another compiler that supports OpenACC.

I did write this using MPI+OpenACC tutorial but I haven’t updated it in about 10 years, which was pre-MPI3 and CUDA Aware MPI so doesn’t have my updated device assignment nor uses host_data. Though it might better clarify the process.

I was going to point you to Ron Caplan’s POT3d code as an example, but it looks like he’s no longer posting the OpenACC version, just the Fortran STDPAR (i.e. DO CONNCURRENT) version. Though, he’s also has posted a simple MPI+OpenACC example you might look at.

The SPEChpc 2021 benchmark suite is also a good example resource for MPI+OpenACC (as well as MPI+OpenMP), though while free for academic and non-commercial use, your organization would need to apply for a license.

Hope this helps,
Mat

1 Like

Thanks Mat, some great advice there. I will come back with some other questions when I have implemented it, no doubt. Regards Giles.

when I run mpifort I get the following errors - does this mean that the MPI is not correctly linked to the nvidia fortran?

gfortran: error: unrecognized command line option ‘-static-nvidia’
gfortran: error: unrecognized command line option ‘-acc=host’

Also I’m getting this error message which I cant figure out, why its causing a problem…

108 | & MPI_REAL,xmax_rank,tag, cart_comm, MPI_STATUS_IGNORE, ierr)
Error: There is no specific subroutine for the generic ‘mpi_sendrecv’ at (1)

I get this error message too:
(when compiling with mpifort or mpif90)

18 | #ifdef _OPENACC
Warning: Illegal preprocessor directive

I just realised there are 2 completely different locations:

which pgf90
/opt/nvidia/hpc_sdk/Linux_x86_64/23.1/compilers/bin/pgf90
which mpif90
/usr/bin/mpif90

just realised that I need to modify the PATH to use the correct version of MPI

PATH=/opt/nvidia/hpc_sdk/Linux_x86_64/23.1/comm_libs/mpi/bin:$PATH; export PATH

now compiling with the correct mpifort, I am getting only the following error message:

NVFORTRAN-S-0155-Could not resolve generic procedure mpi_sendrecv (ufolbm-35.for: 107)
NVFORTRAN-S-0155-Could not resolve generic procedure mpi_sendrecv (ufolbm-35.for: 114)
NVFORTRAN-S-0155-Could not resolve generic procedure mpi_sendrecv (ufolbm-35.for: 121)
NVFORTRAN-S-0155-Could not resolve generic procedure mpi_sendrecv (ufolbm-35.for: 128)
NVFORTRAN-S-0155-Could not resolve generic procedure mpi_sendrecv (ufolbm-35.for: 135)
NVFORTRAN-S-0155-Could not resolve generic procedure mpi_sendrecv (ufolbm-35.for: 142)
0 inform, 0 warnings, 6 severes, 0 fatal for mpi_bcs

looks like I had forgot to declare the tag variable as integer…
idkw mpifort doesnt give more precise error message tho.

Got the code running with mpi -np 4,
but it is only showing the solution on a quarter of the mesh…!?

Yes, you want to use an MPI configured to use nvfortran. You could use gfortran if you like, but need to change the flags to enable their OpenACC support.

but it is only showing the solution on a quarter of the mesh…!?

Sorry, no idea. Though does it work if you disable OpenACC?

No, it doesnt work if acc is disabled.

Ok, then it’s some issue with your code.

The code is seg faulting so the best method here is to add “-g” to the compile flags to enable debugging code, and then using a single rank, run it through a debugger like ‘gdb’. Something like “mpirun -np 1 gdb my.exe”, then in gdb type “run args” replacing “args” with the command line arguments for the application. When it hits the segv, type “where” to show where it occurs.

did you say there is an option to use -gpu=ccall to compile for all compute capacities?

Yes, there is the “ccall” option which will create binary code for all supported devices. It will make your executable a bit bigger but not a big worry. Though I wouldn’t use it if I’m building for my own system since it’s unnecessary. It’s primarily use if you’re distributing your executable and you want to ensure to code will run on an end-users system and you don’t know what they have.

perfect thanks - seems to work well

I have got the MPI part working ok now, so next I can add in your boiler plate code to assigned each rank to a device. But I had a problem running mpirun on the remote machine (which is a cloud machine from AWS)… So how do I ensure that the correct mpirun is installed on the cloud machine? Presumably I have to add some script to install it every time I run a new case. Also what does #ifdef _OPENACC actually mean - is that looking for a command line argument?

Are you using containers? Are you running on a single node or multiple nodes?

For containers, we do publish one that you can use at: NVIDIA HPC SDK | NVIDIA NGC

With the documentation at: HPC Container Maker Version 24.11

That’s probably the best way since it ensures that you have all the components you need.

For multi-node runs using containers, I think there are extra steps you need to take to get the communication between nodes setup. Though I don’t know enough about it so would need to pull in someone else to help.