Reduction variables take wrong values inside a loop

Hi, the following code is an example of the problem I am facing now. I want to get reduction variables, nmax and ierr, from a GPU parallelized region. One of the variables, ierr, is for checking an error inside the kernel region that takes 1 if i.gt.ndim which should not happen in this case and takes 0 otherwise.

program test
  implicit none

  integer,parameter:: ndim = 10
  integer:: i,ierr,nmax
  real(8):: arr(ndim)

  nmax= 0
  ierr = 0
!$acc data copy(nmax,ierr)
!$acc kernels
!$acc loop reduction(max:nmax,ierr)
  do i=1,ndim
     if( i.gt.ndim ) ierr = 1
     print *,'i,ierr,nmax=',i,ierr,nmax
     if( ierr.gt.0 ) cycle
     nmax = max(nmax,i)
  enddo
!$acc end kernels
!$acc end data
print *,'nmax=',nmax

end program test

I expected the nmax value at the end should be 10, but it returns 0. And the ierr and nmax values inside the loop are something wrong as follows.

 i,ierr,nmax=            1  -2147483648  -2147483648
 i,ierr,nmax=            2  -2147483648  -2147483648
 i,ierr,nmax=            3  -2147483648  -2147483648
 i,ierr,nmax=            4  -2147483648  -2147483648
 i,ierr,nmax=            5  -2147483648  -2147483648
 i,ierr,nmax=            6  -2147483648  -2147483648
 i,ierr,nmax=            7  -2147483648  -2147483648
 i,ierr,nmax=            8  -2147483648  -2147483648
 i,ierr,nmax=            9  -2147483648  -2147483648
 i,ierr,nmax=           10  -2147483648  -2147483648
 nmax=            0

What is wrong with the above code?

The compilation message:

$ nvfortran -acc -Minfo=accel test.F90
test:
     10, Generating copy(ierr,nmax) [if not already present]
     13, Loop is parallelizable
         Generating Tesla code
         13, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
             Generating reduction(max:nmax,ierr)
  • Cent OS 7
  • Quadro RTX 5000
  • nvfortran 20.7-0 LLVM 64-bit target on x86-64 Linux -tp skylake
  • Driver Version: 450.57 CUDA Version: 11.0

Thanks in advance.

Hi kobayashi.ryo,

The core problem here is that you can’t use the intermediary values from a reduction within the loop itself. Each gang/vector gets it’s own private copy to the reduction variable, creates a partial reduction, and then does a final reduction at the end. If you really do need the intermediary value, you’ll need to switch to using an atomic capture, however the order will be non-deterministic.

Here’s a simplified working version. Given “i” will never be greater than “ndim”, I removed “ierr”. Though in your real code, will such functionality be necessary?

% cat test.f90
program test
  implicit none

  integer,parameter:: ndim = 10
  integer:: i,ierr,nmax
  real(8):: arr(ndim)

  nmax= 0
  ierr = 0
!$acc kernels
!$acc loop reduction(max:nmax)
  do i=1,ndim
     nmax = max(nmax,i)
  enddo
!$acc end kernels
print *,'nmax=',nmax

end program test
% nvfortran -acc -Minfo=accel test.f90
test:
     10, Generating implicit copy(nmax) [if not already present]
     12, Loop is parallelizable
         Generating NVIDIA GPU code
         12, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
             Generating reduction(max:nmax)
% a.out
 nmax=           10

-Mat

Hi MatColgrove, thank you for your very quick reply. Your code works fine in my side as well. But ierr is necessary in my real code and so as if( ... ) ierr = 1, then the code does not work as I expect. I thought this code should work fine (to return nmax=10) even if ierr is independent across worker/vector…
Could you provide an example with if(i.gt.ndim) ierr=1 and if( ierr.gt.0) cycle lines?

% cat test.F90
cat test.F90
program test
  implicit none

  integer,parameter:: ndim = 10
  integer:: i,ierr,nmax
  real(8):: arr(ndim)

  nmax= 0
  ierr = 0
!$acc kernels
!$acc loop reduction(max:nmax,ierr)
  do i=1,ndim
     if( i.gt.ndim ) ierr = 1
     print *,'i,ierr,nmax=',i,ierr,nmax
     if( ierr.gt.0 ) cycle
     nmax = max(nmax,i)
  enddo
!$acc end kernels
print *,'nmax,ierr=',nmax,ierr

end program test
% nvfortran -acc -Minfo=accel -gpu=cc75 test.F90
nvfortran -acc -Minfo=accel -gpu=cc75 test.F90
test:
     10, Generating implicit copy(ierr,nmax) [if not already present]
     12, Loop is parallelizable
         Generating Tesla code
         12, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
             Generating reduction(max:nmax,ierr)
% ./a.out
./a.out
 i,ierr,nmax=            1  -2147483648  -2147483648
 i,ierr,nmax=            2  -2147483648  -2147483648
 i,ierr,nmax=            3  -2147483648  -2147483648
 i,ierr,nmax=            4  -2147483648  -2147483648
 i,ierr,nmax=            5  -2147483648  -2147483648
 i,ierr,nmax=            6  -2147483648  -2147483648
 i,ierr,nmax=            7  -2147483648  -2147483648
 i,ierr,nmax=            8  -2147483648  -2147483648
 i,ierr,nmax=            9  -2147483648  -2147483648
 i,ierr,nmax=           10  -2147483648  -2147483648
 nmax,ierr=            0            0

Try adding a local private variable to hold the error condition of each iteration, and then a global variable to determine the max error id.

Something like the following. Given “i” can’t be greater than “ndim”, I changed the error condition to be “i.gt.(ndim-2)” to show the code correctly detects the error.

% cat test.f90
program test
  implicit none

  integer,parameter:: ndim = 10
  integer:: i,ierr,nmax,ierr_max
  real(8):: arr(ndim)

  nmax= 0
  ierr = 0
  ierr_max = 0
!$acc kernels
!$acc loop independent reduction(max:nmax,ierr_max)
  do i=1,ndim
     ierr = 0
     if( i.gt.ndim-2 ) ierr = 1
     if( ierr.gt.0 ) then
        print *, "ERROR in iteration: ", i, ierr
        ierr_max = ierr
        cycle
     endif
     nmax = max(nmax,i)
  enddo
!$acc end kernels
print *,'nmax,ierr=',nmax,ierr_max

end program test


% nvfortran -acc test.f90 -Minfo=accel ; a.out
test:
     11, Generating implicit copy(ierr_max,nmax) [if not already present]
     13, Loop is parallelizable
         Generating implicit private(ierr)
         Generating NVIDIA GPU code
         13, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
             Generating reduction(max:nmax,ierr_max)
 ERROR in iteration:             9            1
 ERROR in iteration:            10            1
 nmax,ierr=            8            1

Thank you so much. Now I understand a bit more about behaviors of local and reduction variables.