Loop inside of acc routine seq leads to incorrect results

I’m using the Linux version of pgf90 version 15.7. Doing print statements inside a device routine reveals that a do-loop inside of it is not getting executed correctly, i.e. printing inside the routine only happens once instead of 5 times, printing the loop iterator after “end do” shows that the iterator hasn’t been initialized. I’m afraid my example isn’t yet minimal, but maybe you’ve seen such a behavior anyways. I remember that I was forced to manually inline such routines in the past to get them to work - is there anything I’m doing wrong or is just compiler bugs that still need to be fixed? Thanks in advance. Let me know if you need a standalone minimal example to reproduce it.

…code

subroutine rad_jma1206_zenith_run(nx, ny, ishrt, clat, clon, totalsec, sindel, cosdel, etime, zmean, ztemp)
  use openacc
  use cudafor
  use rad_const, only: timestep
  use rad_parm, only: dtrads
  use pp_vardef
  integer(4), intent(in) :: nx, ny
  integer(4), intent(in) :: ishrt
  real(r_size), intent(in) :: clat(nx,ny)
  real(r_size), intent(in) :: clon(nx,ny)
  real(r_size), intent(in) :: totalsec
  real(r_size), intent(in) :: sindel
  real(r_size), intent(in) :: cosdel
  real(r_size), intent(in) :: etime
  real(r_size), intent(inout) :: zmean(nx,ny)
  real(r_size), intent(inout) :: ztemp(nx,ny)
  real(8) :: hf_output_temp
  integer(4) :: i, j

!$acc kernels present(zmean) present(clon) present(ztemp) present(clat)
!$acc loop independent vector(16)
  do j=1,ny
!$acc loop independent vector(16)
   do i=1,nx
    if (ishrt > 0) then
      if (i == 1 .and. j == 1) then
        print *, "rad_jma1206_zenith_run print", ishrt, totalsec, sindel, cosdel, etime
      end if
    call rad_zenith_update_zmean(i, j, cpie, dtrads, timestep, pai12, pai432, hour_ini, clat(i,j), clon(i,j), totalsec, sindel, cosdel, &
    & etime, zmean(i,j))
    end if
    call rad_zenith_everystep(cpie, pai12, pai432, hour_ini, clat(i,j), clon(i,j), totalsec, sindel, cosdel, etime, ztemp(i,j))
   end do
  end do
!$acc end kernels
end subroutine rad_jma1206_zenith_run

!$acc routine seq
 subroutine rad_zenith_update_zmean(i, j, cpie, dtrads, timestep, pai12, pai432, hour_ini, clat, clon, totalsec, sindel, cosdel, etime, &
  & zmean)
  use openacc
  use cudafor
  use pp_vardef
  implicit none
  integer(4), intent(in) :: i, j
  real(r_size), intent(in) :: cpie, dtrads, timestep, pai12, pai432
  integer(4), intent(in) :: hour_ini
  real(r_size), intent(in) :: clat
  real(r_size), intent(in) :: clon
  real(r_size), intent(in) :: totalsec
  real(r_size), intent(in) :: sindel
  real(r_size), intent(in) :: cosdel
  real(r_size), intent(in) :: etime
  real(r_size), intent(out) :: zmean
  integer(4) :: kt0
  integer(4) :: nrdstp
  integer(4) :: nstp
  real(r_size) :: cosclt
  real(r_size) :: sinclt
  real(r_size) :: sumn
  real(r_size) :: sumcos
  real(r_size) :: sc
  real(r_size) :: cs
  real(r_size) :: ctime
  real(r_size) :: btime
  real(r_size) :: atime
  real(r_size) :: tcosz


  cosclt = sin(clat * cpie)
  sinclt = cos(clat * cpie)
  kt0 = nint(totalsec / dtrads + 0.01)
  nrdstp = int((dtrads * (kt0 + 1) - totalsec) / timestep - 0.001) + 1
  sumn = 0.d0
  sumcos = 0.d0
  ctime = etime + pai12 * (hour_ini - 12) + pai432 * totalsec
  do nstp = 1, nrdstp
   btime = pai432 * timestep * float(nstp - 1)
   sc = sindel * cosclt
   cs = cosdel * sinclt
   atime = ctime + clon * cpie + btime
   tcosz = sc + cs * cos(atime)
   if (tcosz > 0.01) then
   sumcos = sumcos + tcosz
   sumn = sumn + 1.0
   end if
  end do
  if (i == 1 .and. j == 1) then
    print *, "rad_zenith_update_zmean print", sumcos, sumn, ctime, nrdstp, sc, cs, nstp
    end if
  if (sumn >= 1.0) then
  zmean = max(0.01_r_size, sumcos / sumn)
  else
  zmean = 0.0
  end if
  return
end subroutine rad_zenith_update_zmean

…compiling

..........compiling rad_zenith.f90 in /home0/usr4/mueller-m-ab/physlib/hybrid/pp/build/gpu/src
pgf90 -g -O0 -Mchkptr -Mbounds -Kieee -Minfo=accel,inline,ipa -Mneginfo -Minform=inform -Mmpi=mpich -acc -Mcuda=6.5,cc3x -ta=tesla:cc3x,keepgpu,keepbin,time -Minline=levels:5,reshape -DGPU -DGPU -c rad_zenith.f90 -o rad_zenith.o
pgf90-Warning-CUDA Fortran or OpenACC GPU targets disables -Mbounds
rad_zenith_run:
    142, rad_jma1206_zenith_run inlined, size=50, file rad_zenith.f90 (161)
         142, Loop is parallelizable
              Generating present(..inline(:,:))
              Accelerator kernel generated
              Generating Tesla code
             142, !$acc loop gang, vector(16) ! blockidx%y threadidx%y
                  !$acc loop gang, vector(16) ! blockidx%x threadidx%x
         191, rad_zenith_update_zmean inlined, size=38, file rad_zenith.f90 (201)
              142, Accelerator restriction: induction variable live-out from loop: ..inline
                   Scalar last value needed after loop for sindel*e,cosdel*e at line 142
         194, rad_zenith_everystep inlined, size=9, file rad_zenith.f90 (262)
    146, Generating update host(clon(:1,:1),zmean(:1,:1),clat(:1,:1),ztemp(:1,:1))
rad_jma1206_zenith_run:
    182, Generating present(zmean(:,:),clon(:,:),ztemp(:,:),clat(:,:))
    184, Loop is parallelizable
    186, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
        184, !$acc loop gang, vector(16) ! blockidx%y threadidx%y
        186, !$acc loop gang, vector(16) ! blockidx%x threadidx%x
    191, rad_zenith_update_zmean inlined, size=38, file rad_zenith.f90 (201)
         191, Accelerator restriction: induction variable live-out from loop: ..inline
              Scalar last value needed after loop for sindel*e,cosdel*e at line 191
    194, rad_zenith_everystep inlined, size=9, file rad_zenith.f90 (262)

…running

rad_jma1206_zenith_run print            1    0.000000000000000      
 -0.3913847319351452        0.9202271413124341       -1.3131663878269148E-002
 rad_zenith_update_zmean print    14357494.04123566         42190967.00000000      
  -3.154724317468062                 5  -0.2244890597855700      
  0.7538059440162953         103660017

please note the last printed variable nstp, which should IMO be set equal to nrdstp at this point, but it looks to be not initialized. On the CPU the same code works fine. [/quote]

One more thing I tried: Adding $!acc loop seq to the do-loop inside the routine. This changes the compiler output to

191, rad_zenith_update_zmean inlined, size=38, file rad_zenith.f90 (201)
191, Scalar last value needed after loop for sindel1,:1),cosdel1,:1) at line 191

… but the output is still the same.

Hi Michel,

Can you try compiling without inlinining? (i.e. remove -Mipa=inline).

142, Accelerator restriction: induction variable live-out from loop: …inline

This might be the issue. We loops scoping information when inlining so may be producing incorrect code by not privatizing the temp “…inline” variable (this is a local variable declared in rad_jma1206_zenith_run and renamed to a temp name to avoid name collisions).

Note that we are actively working on improving scoping and might have some improvements in 15.9 with further improvements later.

If it still fails without inlining, I may ask for a reproducer.

  • Mat

Thanks Mat. By inlining it manually it does indeed work, so I can confirm that the problem must be somewhere in the inlining. Looking forward to these improvements.