Hi,
I want to run this Fortran code
program main
use mpi
implicit none
integer, parameter :: dp=selected_real_kind(15,9)
integer,parameter :: n=10000000
integer :: i
real(kind=dp) :: x
real(kind=dp) :: partial
Integer :: nsize, rank
Integer :: ierr
Integer :: ini, ter, work
Call MPI_Init(ierr)
Call MPI_Comm_size(MPI_COMM_WORLD, nsize, ierr)
Call MPI_Comm_rank(MPI_Comm_WORLD, rank, ierr)
work = n/nsize
ini = work*rank
ter = work*(rank+1)
partial = 0.0_dp
do concurrent(i=ini:ter-1) reduce(+:partial)
x = 1.0* (i + 0.5_dp)
partial = partial + sin(x*x*x)
enddo
print *, "total = ", partial
Call MPI_Finalize(ierr)
end program main
with MPI such that it could make use of 2 ranks, one for each GPU card that is available on our cluster. I am running the Singularity image provided by envidia:
docker://nvcr.io/nvidia/nvhpc:23.1-devel-cuda_multi-ubuntu20.04
I selected one node on slurm which has 2 GPU cards. But I monitored the GPU usage and only 1 was used. I am running the Singularity container as follows:
singularity exec --nv nvhpc_23.1_devel.sif mpiexec -np 2 program.exe
Is it possible to use the 2 cards with this “do concurrent” workflow? And with Singularity? I saw that in cudafor there is an instructions cudasetdevice but this is for cuda fortran. Thanks.
You still need to map GPUs to MPI ranks.
The do concurrent backend for GPU is using openACC, so you can use the openACC syntax ( assignment in the attached code is very simple, it should be more robust). Also, your code is incorrect, you will need a MPI_REDUCE or ALLREDUCE to get the correct final sum ( each GPU is going to compute a partial one).
program main
use mpi
!@acc use openacc
implicit none
integer, parameter :: dp=selected_real_kind(15,9)
integer,parameter :: n=10000000
integer :: i
real(kind=dp) :: x
real(kind=dp) :: partial
Integer :: nsize, rank
Integer :: ierr
Integer :: ini, ter, work
integer(acc_device_kind) ::dev_type
Call MPI_Init(ierr)
Call MPI_Comm_size(MPI_COMM_WORLD, nsize, ierr)
Call MPI_Comm_rank(MPI_Comm_WORLD, rank, ierr)
!@acc dev_type = acc_get_device_type()
!@acc call acc_set_device_num(rank,dev_type)
!@acc call acc_init(dev_type)
work = n/nsize
ini = work*rank
ter = work*(rank+1)
partial = 0.0_dp
do concurrent(i=ini:ter-1) reduce(+:partial)
x = 1.0* (i + 0.5_dp)
partial = partial + sin(x*x*x)
enddo
call MPI_ALLREDUCE(MPI_IN_PLACE,partial ,1,MPI_DOUBLE_PRECISION,MPI_SUM,MPI_COMM_WORLD,ierr)
if (rank==0) print *, "total = ", partial
Call MPI_Finalize(ierr)
end program main
If you compile it
mpif90 -stdpar -Minfo=accel forum_mpi.f90
and run with
mpirun -np 2 ./a.out
you should see the correct answer and all the GPUs used.
I guarded the openacc directives/functions, so you can also compile the same file for CPU
mpif90 -stdpar=multicore -Minfo forum_mpi.f90
To add, while it’s preferred to add the device assignment to the code via OpenACC or CUDA Fortran, the other way is to use the environment variable CUDA_VISIBLE_DEVICES so each rank only sees one of the devices.
You’ll want to create a wrapper script to set CUDA_VISIBLE_DEVICES based on the local rank id. The exact method will depend on the MPI in use. This example is using OpenMPI with bash:
wrapper.sh
#!/bin/bash
export LOCAL_RANK=$OMPI_COMM_WORLD_LOCAL_RANK
export CUDA_VISIBLE_DEVICES=$OMPI_COMM_WORLD_LOCAL_RANK
exec $*
Then use the script in your mpirun command. Something like:
$ mpirun -np 2 ./wrapper.sh ./a.out
1 Like
I tried this alternative. I haven’t seen my 2 GPU cards fully busy up to now! Thanks for this recommendation.