Multi GPU issue on triple ijk loops

Hi everyone,

I’m using a fortran 90 MPI code and I compile it with mpif90 under nvhpc-23.1. The first thing that I do is the following:

integer :: LOCAL_COMM
integer :: num_gpus, my_gpu
integer(kind=acc_device_kind) :: device_type

call MPI_INIT(ierr)
call mpi_comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, LOCAL_COMM, ierr)
call MPI_COMM_RANK(LOCAL_COMM, my_id, ierr )
call MPI_COMM_SIZE(MPI_COMM_WORLD, num_proc, ierr)

! OpenACC set device
if (my_id == 0) print*, "Using Multi-GPU OpenACC"

device_type = acc_get_device_type()
num_gpus = acc_get_num_devices(device_type)

call acc_set_device_num(my_id, device_type)

my_gpu = acc_get_device_num(device_type)

write(*,*) "CPU Rank ",my_id,": using GPU",my_gpu," of type ",device_type

This is done to ensure a 1 to 1 communication between the CPUs and the GPUs. This is done partially in the main code (main.f90) and in another subroutine used to initialize MPI. After that in another .f90 file I do the following:

SUBROUTINE kinetics
USE global_mod, ONLY: TFlame, NsMAX, num_zones, zones, MINi, MAXi, MINj, MAXj, MINk, MAXk
USE common_alloc
USE openacc
USE common_mpi, ONLY: my_gpu

INTEGER, VALUE :: B, i, j, k, l, ll, s, icomp
!$acc declare create(B, l, ll, s, icomp)
DOUBLE PRECISION, VALUE :: OM(NUMSP), Yi_ijk(NsMAX), Xir(NsMAX)
!$acc declare create(OM(:), Yi_ijk(:), Xir(:))
DOUBLE PRECISION, VALUE :: T_ijk, p_ijk, dens, Rgst,S_y, Wmix_ijk, D_hp,rho_ijk
!$acc declare create(T_ijk, p_ijk, dens, Rgst,S_y, Wmix_ijk, D_hp,rho_ijk)
DOUBLE PRECISION, VALUE :: VREAZ(1000)
!$acc declare create(VREAZ(:))
INTEGER, VALUE :: NsMAT
INTEGER, VALUE :: iR, CC, NUMDICOMP, ISPECIE
INTEGER, VALUE :: NUMDICOMPF, NUMDICOMPB, REV, TERZOCORPO
!$acc declare create(NsMAT, iR, CC, NUMDICOMP, ISPECIE, NUMDICOMPF, NUMDICOMPB, REV, TERZOCORPO)
DOUBLE PRECISION, VALUE :: Kchem, Kchem_B, PRODF, PRODB, KEQ
!$acc declare create(Kchem, Kchem_B, PRODF, PRODB, KEQ)
DOUBLE PRECISION, VALUE :: NIF, NIB, NII!, omF(NUMSP),omB(NUMSP)
!$acc declare create(NIF, NIB, NII)
DOUBLE PRECISION, VALUE :: SOMMAH,SOMMAS,TEM
DOUBLE PRECISION, VALUE :: app
!DOUBLE PRECISION, POINTER, CONTIGUOUS :: A_app(:,:)
DOUBLE PRECISION, VALUE :: Y(NsMAX)
!$acc declare create(SOMMAH,SOMMAS,TEM,app,Y(:))
DOUBLE PRECISION, VALUE :: tmpsum,tmpsum2
!$acc declare create(tmpsum,tmpsum2)

!$acc declare copyin(my_gpu)


!$acc update device(TFlame,NUMREAZ,NUMDISPREAZF,NUMDISPREAZB,FLAGREAZ,FLAGM,TAB1F,TAB3F,TAB1B,TAB3B,FORD3,MASS,RORD3,DHF,DSF,REACT_PLOG,TAB2,TAB2M,TROE,SRI,NUMDISPM,NUMSP,ALFAM)
!$acc parallel loop gang vector collapse(3) &
!$acc& private(i,j,k,s,iR,cc,dens,icomp,T_ijk,p_ijk,NsMAT,NUMDICOMPF,NUMDICOMPB,REV,TERZOCORPO,PRODF,PRODB,NUMDICOMP,ISPECIE,NIF,NIB,NII,tmpsum,tmpsum2) &
!$acc& reduction(+:S_y,tmpsum,tmpsum2) reduction(*:Yi_ijk(:),PRODF,PRODB,Kchem) &
!$acc& copyout(omega) copyin(p,T,NsMAX,Yi(:,:,:,:),num_zones,zones,MINi,MAXi,MINj,MAXj,MINk,MAXk,BBB,rho,my_gpu) 
 do k= MINk(BBB)-(Ghost-1), MAXk(BBB)+(Ghost-1)
  do j= MINj(BBB)-(Ghost-1), MAXj(BBB)+(Ghost-1)
   do i= MINi(BBB)-(Ghost-1), MAXi(BBB)+(Ghost-1)

          if(i==3.and.j==34.and.k==1) then
             if(my_gpu==1) then
                  print*, "I am GPU #",my_gpu,i,j,k
             elseif (my_gpu==0) then
                  print*, "I am GPU #",my_gpu,i,j,k
             else
                  print*, "NO GPU"
             endif
           endif

       end do ! Loop su i
      end do ! Loop su j
     end do ! Loop su k
     !$acc update self(omega)
     print*, "EXITING"

Basically, I just do nothing inside the loops except for the print. The issue is that when running the program with

mpirun -n 2 ./my_program

I found for the same value of i,j,k (that are the nodes of a CFD computational grid) that both GPUs are working on it, which is strange because a CPU should work on some i,j,k and the rest should be done by the other CPU. The same thing should be done by the GPUs. However, when running the code it prints me:

 I am GPU #            1            3           34            1 
 EXITING
 I am GPU #            0            3           34            1 
 EXITING

which means that on i=3, j=34 and k=1 both GPUs are calculating the CFD solution on it.

Thanks everybody in advance for help or any suggestion to how solve this problem.

To me, it looks like your multi-GPU assignment is working properly at least. However, your loop bounds could be getting assigned improperly (i.e. k = MINk(BBB) - (Ghost-1) to MAXk(BBB) + (Ghost-1) might be getting different values on a CPU only run vs. a GPU run)

To help problem solve this, I think you should make sure you know what values you’re expecting in a CPU only run using 2 MPI tasks for your MIN arrays (evaluated at BBB), the actual BBB value expected, the actual Ghost value expected, and the MAX counterparts to the MIN values. This should tell you what you expect the lower and upper end of i, j, and k values to be in the loop. When you know what values you’re expecting for those loop variables in a CPU only run, you should then see what values you’re getting in your GPU run for all of those values you checked on the CPU. For completeness, I’d print the MIN array at BBB, BBB, etc… values in both a CPU section (so we know what the CPU thinks those values are), a GPU parallel section (so we know what the GPU thinks those values are) before the loop and then I’d also explicitly print the i,j,k value in the loop to look at their values. Since there’s two MPI tasks, make sure your print statements illustrate which task is writing.

Presuming you’re correct and only one MPI task should evaluate the i=3, j=34, and k=1 entry, then the disagreement in the CPU only run values that assign the loop bounds and the GPU run values that assign the loop bounds will likely be our culprit. In the GPU run, it could be that you have assigned the important values on the GPU and not updated the CPU or vice-versa.

If that doesn’t help resolve your issue, you might investigate trying to pare your code down to a minimum reproducing example that I can take and play with to help.

Solved in the way you said! Thank you again!

-Matteo

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.