Hi,

I want to run this Fortran code

program main

use mpi

implicit none

integer, parameter :: dp=selected_real_kind(15,9)

integer,parameter :: n=10000000

```
integer :: i
real(kind=dp) :: x
real(kind=dp) :: partial
Integer :: nsize, rank
Integer :: ierr
Integer :: ini, ter, work
Call MPI_Init(ierr)
Call MPI_Comm_size(MPI_COMM_WORLD, nsize, ierr)
Call MPI_Comm_rank(MPI_Comm_WORLD, rank, ierr)
work = n/nsize
ini = work*rank
ter = work*(rank+1)
partial = 0.0_dp
do concurrent(i=ini:ter-1) reduce(+:partial)
x = 1.0* (i + 0.5_dp)
partial = partial + sin(x*x*x)
enddo
print *, "total = ", partial
Call MPI_Finalize(ierr)
```

end program main

with MPI such that it could make use of 2 ranks, one for each GPU card that is available on our cluster. I am running the Singularity image provided by envidia:

docker://nvcr.io/nvidia/nvhpc:23.1-devel-cuda_multi-ubuntu20.04

I selected one node on slurm which has 2 GPU cards. But I monitored the GPU usage and only 1 was used. I am running the Singularity container as follows:

singularity exec --nv nvhpc_23.1_devel.sif mpiexec -np 2 program.exe

Is it possible to use the 2 cards with this â€śdo concurrentâ€ť workflow? And with Singularity? I saw that in cudafor there is an instructions cudasetdevice but this is for cuda fortran. Thanks.

You still need to map GPUs to MPI ranks.

The do concurrent backend for GPU is using openACC, so you can use the openACC syntax ( assignment in the attached code is very simple, it should be more robust). Also, your code is incorrect, you will need a MPI_REDUCE or ALLREDUCE to get the correct final sum ( each GPU is going to compute a partial one).

```
program main
use mpi
!@acc use openacc
implicit none
integer, parameter :: dp=selected_real_kind(15,9)
integer,parameter :: n=10000000
integer :: i
real(kind=dp) :: x
real(kind=dp) :: partial
Integer :: nsize, rank
Integer :: ierr
Integer :: ini, ter, work
integer(acc_device_kind) ::dev_type
Call MPI_Init(ierr)
Call MPI_Comm_size(MPI_COMM_WORLD, nsize, ierr)
Call MPI_Comm_rank(MPI_Comm_WORLD, rank, ierr)
!@acc dev_type = acc_get_device_type()
!@acc call acc_set_device_num(rank,dev_type)
!@acc call acc_init(dev_type)
work = n/nsize
ini = work*rank
ter = work*(rank+1)
partial = 0.0_dp
do concurrent(i=ini:ter-1) reduce(+:partial)
x = 1.0* (i + 0.5_dp)
partial = partial + sin(x*x*x)
enddo
call MPI_ALLREDUCE(MPI_IN_PLACE,partial ,1,MPI_DOUBLE_PRECISION,MPI_SUM,MPI_COMM_WORLD,ierr)
if (rank==0) print *, "total = ", partial
Call MPI_Finalize(ierr)
end program main
```

If you compile it

mpif90 -stdpar -Minfo=accel forum_mpi.f90

and run with

mpirun -np 2 ./a.out

you should see the correct answer and all the GPUs used.

I guarded the openacc directives/functions, so you can also compile the same file for CPU

mpif90 -stdpar=multicore -Minfo forum_mpi.f90

To add, while itâ€™s preferred to add the device assignment to the code via OpenACC or CUDA Fortran, the other way is to use the environment variable CUDA_VISIBLE_DEVICES so each rank only sees one of the devices.

Youâ€™ll want to create a wrapper script to set CUDA_VISIBLE_DEVICES based on the local rank id. The exact method will depend on the MPI in use. This example is using OpenMPI with bash:

wrapper.sh

```
#!/bin/bash
export LOCAL_RANK=$OMPI_COMM_WORLD_LOCAL_RANK
export CUDA_VISIBLE_DEVICES=$OMPI_COMM_WORLD_LOCAL_RANK
exec $*
```

Then use the script in your mpirun command. Something like:

`$ mpirun -np 2 ./wrapper.sh ./a.out`

1 Like

I tried this alternative. I havenâ€™t seen my 2 GPU cards fully busy up to now! Thanks for this recommendation.