I think you’ve stumbled onto what appears to be a subtle bug here - I’ll check with a few colleagues and if it is a bug, I’ll report it to our engineering team. It has some very bug like behaviors - those you’ve seen and those I stumbled onto. If I add a print of “localmaxi” on the GPU right before the reduction, I can get the update to work properly. More interestingly for you, the following appears to work properly and gives the right answer:
program mpi_on_gpu
implicit none
include “mpif.h”
integer :: myrank, allranks, error, i, j, localmaxi, maxi
integer, parameter :: N=1000
integer, dimension(N,N) :: tab
call MPI_init(error)
call MPI_comm_size(MPI_COMM_WORLD, allranks, error)
call MPI_comm_rank(MPI_COMM_WORLD, myrank, error)
write(6,‘(2(a,i0))’)"Hello from “,myrank,” over ",allranks
tab(:,:) = 0
localmaxi = 0
! Initialization of the arrays only on GPU, keep 0 on host
!$acc enter data create(tab) copyin(localmaxi)
!$acc parallel loop collapse(2) present(tab)
do j=1,N
do i=1,N
tab(i,j)= (i+j)*(myrank+1)
end do
end do
!I need variable on the GPU for calculating the max of the array
!Run the reduction on the GPU
!$acc parallel loop collapse(2) present(tab,localmaxi) reduction(max:localmaxi)
do j=1,N
do i=1,N
if (localmaxi .lt. tab(i,j)) localmaxi=tab(i,j)
end do
end do
! Check that on the Host the variable is still 0
write(6,‘(2(a,i0),a)’) “On “,myrank,” the maximum on the host before reduce is “,localmaxi ,” (should be 0)”
! Now do a reduction between GPU values (so on GPU only). Each process should have it’s max on GPU
!$acc host_data use_device (localmaxi)
call MPI_allreduce(MPI_IN_PLACE, localmaxi, 1, MPI_INT, MPI_MAX, error)
!$acc end host_data
! Check that on the Host the variable is still 0
write(6,‘(2(a,i0),a)’) “On “,myrank,” the maximum on the host after reduce is “,localmaxi ,” (should be 0)”
! Bring back the information on the host (but keep it also on GPU)
!$acc update self(localmaxi)
! $acc exit data copyout(localmaxi)
write(6,‘(3(a,i0),a)’) "On “,myrank,” the maximum on the host after update is ",localmaxi, “(should be “,(N+N)*allranks,”)”
call MPI_finalize(error)
end program mpi_on_gpu
Here the only difference is that when you’re creating the tab array on the GPU, I also put localmaxi there as well (on top of initializing it earlier). Here the output didn’t segfault and I got the following result (I also added some text to the later outputs to clarify exactly where they occur in the code):
Hello from 0 over 2
Hello from 1 over 2
On 0 the maximum on the host before reduce is 0 (should be 0)
On 1 the maximum on the host before reduce is 0 (should be 0)
On 0 the maximum on the host after reduce is 0 (should be 0)
On 0 the maximum on the host after update is 4000(should be 4000)
On 1 the maximum on the host after reduce is 0 (should be 0)
On 1 the maximum on the host after update is 4000(should be 4000)
Also please note that for properly using multi-gpus in an MPI run, you’d also likely need to use ‘acc_set_device_num’ so that each MPI rank gets assigned properly to a unique GPU.