Using MPI for with do concurrent in Fortran

peterp2023 · April 4, 2023, 5:21pm

Hi,

I want to run this Fortran code

program main
use mpi
implicit none
integer, parameter :: dp=selected_real_kind(15,9)
integer,parameter :: n=10000000

integer                    :: i
real(kind=dp)              :: x
real(kind=dp)              :: partial

Integer :: nsize, rank
Integer :: ierr
Integer :: ini, ter, work

Call MPI_Init(ierr)

Call MPI_Comm_size(MPI_COMM_WORLD, nsize, ierr)
Call MPI_Comm_rank(MPI_Comm_WORLD, rank, ierr)

work = n/nsize
ini = work*rank
ter = work*(rank+1)
partial = 0.0_dp
do concurrent(i=ini:ter-1) reduce(+:partial)
   x = 1.0* (i + 0.5_dp)
   partial = partial + sin(x*x*x)
enddo

print *, "total = ", partial

Call MPI_Finalize(ierr)

end program main

with MPI such that it could make use of 2 ranks, one for each GPU card that is available on our cluster. I am running the Singularity image provided by envidia:

docker://nvcr.io/nvidia/nvhpc:23.1-devel-cuda_multi-ubuntu20.04

I selected one node on slurm which has 2 GPU cards. But I monitored the GPU usage and only 1 was used. I am running the Singularity container as follows:

singularity exec --nv nvhpc_23.1_devel.sif mpiexec -np 2 program.exe

Is it possible to use the 2 cards with this “do concurrent” workflow? And with Singularity? I saw that in cudafor there is an instructions cudasetdevice but this is for cuda fortran. Thanks.

mfatica · April 5, 2023, 3:34pm

You still need to map GPUs to MPI ranks.
The do concurrent backend for GPU is using openACC, so you can use the openACC syntax ( assignment in the attached code is very simple, it should be more robust). Also, your code is incorrect, you will need a MPI_REDUCE or ALLREDUCE to get the correct final sum ( each GPU is going to compute a partial one).

program main
use mpi
!@acc use openacc
implicit none
integer, parameter :: dp=selected_real_kind(15,9)
integer,parameter :: n=10000000

integer                    :: i
real(kind=dp)              :: x
real(kind=dp)              :: partial

Integer :: nsize, rank
Integer :: ierr
Integer :: ini, ter, work
integer(acc_device_kind) ::dev_type

Call MPI_Init(ierr)

Call MPI_Comm_size(MPI_COMM_WORLD, nsize, ierr)
Call MPI_Comm_rank(MPI_Comm_WORLD, rank, ierr)

!@acc  dev_type = acc_get_device_type()
!@acc  call acc_set_device_num(rank,dev_type)
!@acc  call acc_init(dev_type)

work = n/nsize
ini = work*rank
ter = work*(rank+1)
partial = 0.0_dp
do concurrent(i=ini:ter-1) reduce(+:partial)
   x = 1.0* (i + 0.5_dp)
   partial = partial + sin(x*x*x)
enddo

call MPI_ALLREDUCE(MPI_IN_PLACE,partial ,1,MPI_DOUBLE_PRECISION,MPI_SUM,MPI_COMM_WORLD,ierr)
if (rank==0) print *, "total = ", partial

Call MPI_Finalize(ierr)
end program main

If you compile it

mpif90 -stdpar -Minfo=accel forum_mpi.f90

and run with

mpirun -np 2 ./a.out

you should see the correct answer and all the GPUs used.

I guarded the openacc directives/functions, so you can also compile the same file for CPU

mpif90 -stdpar=multicore -Minfo forum_mpi.f90

MatColgrove · April 5, 2023, 5:26pm

To add, while it’s preferred to add the device assignment to the code via OpenACC or CUDA Fortran, the other way is to use the environment variable CUDA_VISIBLE_DEVICES so each rank only sees one of the devices.

You’ll want to create a wrapper script to set CUDA_VISIBLE_DEVICES based on the local rank id. The exact method will depend on the MPI in use. This example is using OpenMPI with bash:

wrapper.sh

#!/bin/bash
export LOCAL_RANK=$OMPI_COMM_WORLD_LOCAL_RANK
export CUDA_VISIBLE_DEVICES=$OMPI_COMM_WORLD_LOCAL_RANK
exec $*

Then use the script in your mpirun command. Something like:
$ mpirun -np 2 ./wrapper.sh ./a.out

peterp2023 · April 9, 2023, 11:00am

I tried this alternative. I haven’t seen my 2 GPU cards fully busy up to now! Thanks for this recommendation.

Topic		Replies	Views
Good reference/examples for CUDA fortran with MPI, please? Legacy PGI Compilers	9	1301	April 5, 2024
Running CUDA-Fortran on multiple GPU nodes nvc, nvc++ and nvfortran	4	816	March 12, 2021
Compilation issues between Fortran with MPI and CUDA Fortran nvc, nvc++ and nvfortran	3	1541	March 2, 2021
Failure when using OpenACC after MPI_Init nvc, nvc++ and nvfortran	7	1586	April 23, 2021
Efficiency of MPI transfer(in cuda fortran + MPI code) Legacy PGI Compilers	0	8933	September 3, 2018
Using multiple GPUs Legacy PGI Compilers	7	22086	August 11, 2009
CUDA Fortran+Openmp problem Legacy PGI Compilers	9	1136	March 3, 2022
MPI+CUDA Fortran Legacy PGI Compilers	4	9006	May 16, 2017
Beginning Question about CUDA-aware MPI nvc, nvc++ and nvfortran	6	48	May 10, 2025
Unusually slow MPI communication between GPUs nvc, nvc++ and nvfortran	1	515	September 5, 2023

Using MPI for with do concurrent in Fortran

Related topics