Reduction in Nested Loops

I am trying to understand the reduction in nested loops.
Consider a very simple program which calculates the number of iterations of inner loop.

PROGRAM main
integer N, i,j
integer*8 ans1
N = 10000
ans1 = 0

!$acc parallel copyin(N) reduction(+:ans1)
do i = 1, N
do j = 1, N
ans1 = ans1 + 1
enddo
enddo
!$acc end parallel

write(*,*) 'ans1 = ', ans1

END PROGRAM main

(Q-1) WHY is this giving CORRECT answer (100000000), when the OpenACC manual says:“If a variable is involved in a reduction that spans multiple nested loops where two or more of those loops have associated loop directives, a reduction clause containing that variable must appear on each of those loop directives.”

Now, consider a very small change to also count the number of outer-loop iterations (by introducing ans2 variable).

PROGRAM main
integer N, i,j
integer*8 ans1,ans2
N = 10000
ans1 = 0
ans2 = 0

!$acc parallel copyin(N) reduction(+:ans1,ans2)
do i = 1, N
ans2 = ans2 + 1
do j = 1, N
ans1 = ans1 + 1
enddo
enddo
!$acc end parallel

write(*,*) 'ans1 = ', ans1
write(*,*) 'ans2 = ', ans2

END PROGRAM main

(Q-2) WHY is this giving wrong answer (780000, 10000)? What is surprising is: ans1 which was earlier correct has gone wrong now!!!

If I add a loop construct before j-loop with reduction on ans1, it works. Okay fine, as the manual also says the same.
But then Q-1 remains.

Request someone to please clarify.

Thanks,
Arun

Correct, to be compliant with the standard you should technically be putting the reduction clause on both loops, though often the compiler is able to implicitly add the reduction for you when it’s analysis has not been overridden by the user. (i.e. “auto” is used when the user has not explicitly added a “loop” directive or when using “kernels”)

Note that you aren’t using a “loop” directive here in which case the reduction is being applied to the parallel region, not a loop so is only applied to the gang loop.

In looking at the compiler feedback messages, it’s not parallelizing the outer loop and only applying the reduction to the inner loop:

% nvfortran -fast -acc test.F90 -Minfo=accel -V20.5 ; a.out
main:
      7, Generating copyin(n) [if not already present]
         Generating implicit copy(ans1) [if not already present]
         Generating Tesla code
          8, !$acc loop seq
          9, !$acc loop vector(128) ! threadidx%x
             Generating reduction(+:ans1)
      8, Loop is parallelizable
      9, Loop is parallelizable
 ans1 =                 100000000

As of the 20.1, the second example gets the expected answer. Though without the loop clause, the code is still relying on the compiler analysis to apply the loop schedules so is only parallelizing the inner loop and applying an implicit reduction:

% pgfortran -ta=tesla -Minfo=accel test2.F90 -V20.1 ; a.out
main:
      8, Generating copyin(n) [if not already present]
         Generating implicit copy(ans2) [if not already present]
         Generating Tesla code
          9, !$acc loop seq
             Generating reduction(+:ans1,ans2)
         11, !$acc loop vector(128) ! threadidx%x
             Generating implicit reduction(+:ans1)
      8, Generating implicit copy(ans1) [if not already present]
      9, Loop is parallelizable
     11, Loop is parallelizable
 ans1 =                 100000000
 ans2 =                     10000