Things goes wrong after "host_data use_device" section

patrick.begou · February 12, 2024, 5:42pm

Hi all,

I’m building a small test case to investigate (understand?) mpi communications from GPU to GPU and I do not understand why it fails.
I launch 2 processes. They:

initialize an array on the GPU (lines 19 to 25)
offload a scalar on the GPU to store the max value (line 29)
calculate the max value on the GPU and put it in this offloaded scalar (lines 31 to 37)
run a MPI_allreduce call on this scalar to set a global maximum value using a “host_data” section (lines 43 to 45)

Leaving this host_data region I cannot update the scalar on the host (line 51, just to print it) it segfaults.
The only way is an exit data directive on this scalar before the print (line 52).

If I remove the host_data section and the mpi_allreduce call (lines 42 to 45), the update directive works (but the the value is not reduced on the GPU as I need of course).

An idea of what i am doing wrong ? Program is attached.

I’m using nvhpc-openmpi3/24.1 or nvhpc/24.1

mpifort -acc main.f90 -o main
mpirun --mca btl ^openib -n 2 ./main

Running on my laptop with only one T600 GPU.

Thanks for all advices.

main.f90.txt (1.8 KB)

scamp1 · February 12, 2024, 7:04pm

I think you’ve stumbled onto what appears to be a subtle bug here - I’ll check with a few colleagues and if it is a bug, I’ll report it to our engineering team. It has some very bug like behaviors - those you’ve seen and those I stumbled onto. If I add a print of “localmaxi” on the GPU right before the reduction, I can get the update to work properly. More interestingly for you, the following appears to work properly and gives the right answer:

program mpi_on_gpu

implicit none

include “mpif.h”

integer :: myrank, allranks, error, i, j, localmaxi, maxi
integer, parameter :: N=1000
integer, dimension(N,N) :: tab

call MPI_init(error)
call MPI_comm_size(MPI_COMM_WORLD, allranks, error)
call MPI_comm_rank(MPI_COMM_WORLD, myrank, error)
write(6,‘(2(a,i0))’)"Hello from “,myrank,” over ",allranks

tab(:,:) = 0
localmaxi = 0

! Initialization of the arrays only on GPU, keep 0 on host
!$acc enter data create(tab) copyin(localmaxi)
!$acc parallel loop collapse(2) present(tab)
do j=1,N
do i=1,N
tab(i,j)= (i+j)*(myrank+1)
end do
end do

!I need variable on the GPU for calculating the max of the array

!Run the reduction on the GPU
!$acc parallel loop collapse(2) present(tab,localmaxi) reduction(max:localmaxi)
do j=1,N
do i=1,N
if (localmaxi .lt. tab(i,j)) localmaxi=tab(i,j)
end do
end do

! Check that on the Host the variable is still 0
write(6,‘(2(a,i0),a)’) “On “,myrank,” the maximum on the host before reduce is “,localmaxi ,” (should be 0)”

! Now do a reduction between GPU values (so on GPU only). Each process should have it’s max on GPU
!$acc host_data use_device (localmaxi)
call MPI_allreduce(MPI_IN_PLACE, localmaxi, 1, MPI_INT, MPI_MAX, error)
!$acc end host_data

! Check that on the Host the variable is still 0
write(6,‘(2(a,i0),a)’) “On “,myrank,” the maximum on the host after reduce is “,localmaxi ,” (should be 0)”

! Bring back the information on the host (but keep it also on GPU)
!$acc update self(localmaxi)
! $acc exit data copyout(localmaxi)
write(6,‘(3(a,i0),a)’) "On “,myrank,” the maximum on the host after update is ",localmaxi, “(should be “,(N+N)*allranks,”)”

call MPI_finalize(error)

end program mpi_on_gpu

Here the only difference is that when you’re creating the tab array on the GPU, I also put localmaxi there as well (on top of initializing it earlier). Here the output didn’t segfault and I got the following result (I also added some text to the later outputs to clarify exactly where they occur in the code):

Hello from 0 over 2
Hello from 1 over 2
On 0 the maximum on the host before reduce is 0 (should be 0)
On 1 the maximum on the host before reduce is 0 (should be 0)
On 0 the maximum on the host after reduce is 0 (should be 0)
On 0 the maximum on the host after update is 4000(should be 4000)
On 1 the maximum on the host after reduce is 0 (should be 0)
On 1 the maximum on the host after update is 4000(should be 4000)

Also please note that for properly using multi-gpus in an MPI run, you’d also likely need to use ‘acc_set_device_num’ so that each MPI rank gets assigned properly to a unique GPU.

MatColgrove · February 12, 2024, 8:04pm

Hi Patrick,

What error are you seeing?

When I run the code, I’m seeing an intermittent seg fault that occurs with or without the host_data directive due to the missing “comm” argument to MPI_Allreduce. Since you’re using the F77 MPI header file, the arguments aren’t checked but switching to the F90 “use mpi” module, you’ll get a syntax error.

To fix:

!call MPI_allreduce(MPI_IN_PLACE, localmaxi, 1, MPI_INT, MPI_MAX, error)
call MPI_allreduce(MPI_IN_PLACE, localmaxi, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD, error)

If you’re seeing something different, please let me know.

-Mat

scamp1 · February 12, 2024, 8:29pm

Good catch Mat! Strange I was even able to get it to work without that. Maybe something about the F77 MPI header.

patrick.begou · February 12, 2024, 8:32pm

Hi Mat, hi Scamp1,
I am so ashamed of this mistake ! Focusing on OpenACC directives to understand what could be wrong, I’ve not checked my MPI call. Really sorry for the noise.

I was using the mpi module but at compile time I was getting the error:

NVFORTRAN-S-0155-Could not resolve generic procedure mpi_allreduce

and I was thinking it was related to the device status of the data argument, not a missing argument for the communicator.

Really sorry

Patrick

caplanr · February 14, 2024, 12:20am

Hi,

Seems everything has been resolved, but in any case, you may be interested in this code that does GPU MPI Fortran tests:

– Ron

Topic		Replies	Views
Issue of Running OpenMPI on Multiple GPU Nodes with InfiniBand nvc, nvc++ and nvfortran openmpi	12	1797	March 11, 2024
Unusually slow MPI communication between GPUs nvc, nvc++ and nvfortran	1	470	September 5, 2023
MPI send + OpenACC + acc_malloc fail with NVFortran, but work with C nvc, nvc++ and nvfortran	10	64	September 6, 2024
Running nvidia Fortran on multiple GPUs with MPI nvc, nvc++ and nvfortran	37	165	December 14, 2024
Using multiple GPUs Legacy PGI Compilers	7	22070	August 11, 2009
About the inefficiency of the CUDA-aware GPU-to-GPU communication nvc, nvc++ and nvfortran	20	574	December 20, 2024
using all 4 GPUs in S1070 from multi-core cpu? how CUDA Programming and Performance	11	32413	December 13, 2010
Running HPCX-OpenMPI included in NVIDIA HPC SDK 24.1 causes unusual segfault nvc, nvc++ and nvfortran networking-ucx , openmpi , hpc-x	3	632	February 29, 2024
Segfault with MPI_Send + acc_malloc Legacy PGI Compilers	3	2581	April 8, 2020
Howto build OpenMPI with nvhpc/24.1 nvc, nvc++ and nvfortran openmpi	5	1682	March 18, 2024

Things goes wrong after "host_data use_device" section

Related topics