Things goes wrong after "host_data use_device" section

Hi all,

I’m building a small test case to investigate (understand?) mpi communications from GPU to GPU and I do not understand why it fails.
I launch 2 processes. They:

  • initialize an array on the GPU (lines 19 to 25)
  • offload a scalar on the GPU to store the max value (line 29)
  • calculate the max value on the GPU and put it in this offloaded scalar (lines 31 to 37)
  • run a MPI_allreduce call on this scalar to set a global maximum value using a “host_data” section (lines 43 to 45)

Leaving this host_data region I cannot update the scalar on the host (line 51, just to print it) it segfaults.
The only way is an exit data directive on this scalar before the print (line 52).

If I remove the host_data section and the mpi_allreduce call (lines 42 to 45), the update directive works (but the the value is not reduced on the GPU as I need of course).

An idea of what i am doing wrong ? Program is attached.

I’m using nvhpc-openmpi3/24.1 or nvhpc/24.1

mpifort -acc main.f90 -o main
mpirun --mca btl ^openib -n 2 ./main

Running on my laptop with only one T600 GPU.

Thanks for all advices.

main.f90.txt (1.8 KB)

I think you’ve stumbled onto what appears to be a subtle bug here - I’ll check with a few colleagues and if it is a bug, I’ll report it to our engineering team. It has some very bug like behaviors - those you’ve seen and those I stumbled onto. If I add a print of “localmaxi” on the GPU right before the reduction, I can get the update to work properly. More interestingly for you, the following appears to work properly and gives the right answer:

program mpi_on_gpu

implicit none

include “mpif.h”

integer :: myrank, allranks, error, i, j, localmaxi, maxi
integer, parameter :: N=1000
integer, dimension(N,N) :: tab

call MPI_init(error)
call MPI_comm_size(MPI_COMM_WORLD, allranks, error)
call MPI_comm_rank(MPI_COMM_WORLD, myrank, error)
write(6,‘(2(a,i0))’)"Hello from “,myrank,” over ",allranks

tab(:,:) = 0
localmaxi = 0

! Initialization of the arrays only on GPU, keep 0 on host
!$acc enter data create(tab) copyin(localmaxi)
!$acc parallel loop collapse(2) present(tab)
do j=1,N
do i=1,N
tab(i,j)= (i+j)*(myrank+1)
end do
end do

!I need variable on the GPU for calculating the max of the array

!Run the reduction on the GPU
!$acc parallel loop collapse(2) present(tab,localmaxi) reduction(max:localmaxi)
do j=1,N
do i=1,N
if (localmaxi .lt. tab(i,j)) localmaxi=tab(i,j)
end do
end do

! Check that on the Host the variable is still 0
write(6,‘(2(a,i0),a)’) “On “,myrank,” the maximum on the host before reduce is “,localmaxi ,” (should be 0)”

! Now do a reduction between GPU values (so on GPU only). Each process should have it’s max on GPU
!$acc host_data use_device (localmaxi)
call MPI_allreduce(MPI_IN_PLACE, localmaxi, 1, MPI_INT, MPI_MAX, error)
!$acc end host_data

! Check that on the Host the variable is still 0
write(6,‘(2(a,i0),a)’) “On “,myrank,” the maximum on the host after reduce is “,localmaxi ,” (should be 0)”

! Bring back the information on the host (but keep it also on GPU)
!$acc update self(localmaxi)
! $acc exit data copyout(localmaxi)
write(6,‘(3(a,i0),a)’) "On “,myrank,” the maximum on the host after update is ",localmaxi, “(should be “,(N+N)*allranks,”)”

call MPI_finalize(error)

end program mpi_on_gpu

Here the only difference is that when you’re creating the tab array on the GPU, I also put localmaxi there as well (on top of initializing it earlier). Here the output didn’t segfault and I got the following result (I also added some text to the later outputs to clarify exactly where they occur in the code):

Hello from 0 over 2
Hello from 1 over 2
On 0 the maximum on the host before reduce is 0 (should be 0)
On 1 the maximum on the host before reduce is 0 (should be 0)
On 0 the maximum on the host after reduce is 0 (should be 0)
On 0 the maximum on the host after update is 4000(should be 4000)
On 1 the maximum on the host after reduce is 0 (should be 0)
On 1 the maximum on the host after update is 4000(should be 4000)

Also please note that for properly using multi-gpus in an MPI run, you’d also likely need to use ‘acc_set_device_num’ so that each MPI rank gets assigned properly to a unique GPU.

Hi Patrick,

What error are you seeing?

When I run the code, I’m seeing an intermittent seg fault that occurs with or without the host_data directive due to the missing “comm” argument to MPI_Allreduce. Since you’re using the F77 MPI header file, the arguments aren’t checked but switching to the F90 “use mpi” module, you’ll get a syntax error.

To fix:

!call MPI_allreduce(MPI_IN_PLACE, localmaxi, 1, MPI_INT, MPI_MAX, error)
call MPI_allreduce(MPI_IN_PLACE, localmaxi, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD, error)

If you’re seeing something different, please let me know.

-Mat

Good catch Mat! Strange I was even able to get it to work without that. Maybe something about the F77 MPI header.

Hi Mat, hi Scamp1,
I am so ashamed of this mistake ! Focusing on OpenACC directives to understand what could be wrong, I’ve not checked my MPI call. Really sorry for the noise.

I was using the mpi module but at compile time I was getting the error:

NVFORTRAN-S-0155-Could not resolve generic procedure mpi_allreduce

and I was thinking it was related to the device status of the data argument, not a missing argument for the communicator.

Really sorry

Patrick

Hi,

Seems everything has been resolved, but in any case, you may be interested in this code that does GPU MPI Fortran tests:

– Ron