Reduction in Nested Loops contd

I am trying to understand the reduction in nested loops.
This is a follow-up question to: Reduction in Nested Loops

The OpenACC manual says:“If a variable is involved in a reduction that spans multiple nested loops where two or more of those loops have associated loop directives, a reduction clause containing that variable must appear on each of those loop directives.”

Pl find a simple program:

  • ans1 counts the total iterations of inner loop. (100000000)
  • ans2 counts the total iterations of outer loop. (10000)

Now, this program is giving a different (wrong) answer every run!

After the program, below i give 3 changes which give correct answer (but i am not able to understand WHY)

PROGRAM main
integer N, i,j
integer*8 ans1,ans2
N = 10000
ans1 = 0
ans2 = 0

!$acc parallel copyin(N) copy(ans1,ans2)

!$acc loop reduction(+:ans1,ans2)
    do i = 1, N
        ans2 = ans2 + 1

    !$acc loop reduction(+:ans1)
        do j = 1, N
            ans1 = ans1 + 1
        enddo
    !$acc end loop

    enddo
!$acc end loop

!$acc end parallel

write(*,*) 'ans1 = ', ans1
write(*,*) 'ans2 = ', ans2

END PROGRAM main

If i do any of these changes, i get correct answer:
(1) Replicate the 1st reduction (+:ans1,ans2) on the parallel-construct ALSO.
(2) Merge the 1st loop+reduction with the parallel-construct.
(3) Remove the initializations: ans1=0, ans2=0, AND make them copyout (instead of copy).

My question is:
(Q-1) Why is the original program NOT working?
(Q-2) Why is any of the (1),(2),(3) giving correct output?

Pl help.

Thanks,
Arun

Hi Arun,

As with you’re previous post, the test does get correct answers with 20.1 or later:

% pgfortran -ta=tesla -Minfo=accel test3.F90 -V20.1 ; a.out
main:
      8, Generating copy(ans2) [if not already present]
         Generating copyin(n) [if not already present]
         Generating copy(ans1) [if not already present]
         Generating Tesla code
         10, !$acc loop seq
             Generating reduction(+:ans2,ans1)
         13, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
             Generating reduction(+:ans1)
     11, Accelerator restriction: induction variable live-out from loop: ans2
 ans1 =                 100000000
 ans2 =                     10000

Though as before, since you haven’t explicitly set the loop schedules, the compiler is implicitly making the outer loop seq and the inner gang vector. While the compiler typically does a good job at finding the optimal schedule, it’s not in this particular case. Hence, I would suggest explicitly setting the schedule to gang on the outer loop and vector on the inner. The change also helps the 19.10 compiler get the correct answer.

 % diff -u test3.org.F90 test3.F90
--- test3.org.F90       2020-07-16 10:48:57.829103000 -0700
+++ test3.F90   2020-07-16 10:49:09.280638000 -0700
@@ -6,10 +6,10 @@
 ans2 = 0

 !$acc parallel copyin(N) copy(ans1,ans2)
-!$acc loop reduction(+:ans1,ans2)
+!$acc loop gang reduction(+:ans1,ans2)
     do i = 1, N
         ans2 = ans2 + 1
-    !$acc loop reduction(+:ans1)
+    !$acc loop vector reduction(+:ans1)
         do j = 1, N
             ans1 = ans1 + 1
         enddo
% pgfortran -ta=tesla -Minfo=accel test3.F90 -V19.10 ; a.out
main:
      8, Generating copy(ans2) [if not already present]
         Generating copyin(n) [if not already present]
         Generating copy(ans1) [if not already present]
         Generating Tesla code
         10, !$acc loop gang ! blockidx%x
             Generating reduction(+:ans2,ans1)
         13, !$acc loop vector(128) ! threadidx%x
             Generating reduction(+:ans1)
     13, Loop is parallelizable
 ans1 =                 100000000
 ans2 =                     10000

Hope this helps,
Mat

Thank you Mat for both the replies related to Reduction in nested loops.
Yes, i am using PGI-19.10.

Few Questions:
(Q1) I see that the latest 20.1 is only available as a professional-version. Any idea when the community-edition will be released?

(Q2) I see that you are also using nvfortran. Is this same as pgfortran, or developed independently? Would you recommend this (instead of pgfortran)?

(Q3) Finally, would CUDA programming be better, given the OpenACC standard related issues?

arun

PGI 20.4 was the last PGI branded release and no additional Community Editions will be made available. PGI has been re-branded as the NVIDIA HPC Compiler and is part of NVIDIA’s HPC SDK (https://developer.nvidia.com/hpc-sdk). “nvfortran” is the re-branded “pgfortran”,

The NVIDIA HPC SDK is available at no-cost for all releases but does require registration as an NVIDIA developer. Version 20.5 is currently available as an early access release, so you’ll need to apply for access, but the application using only takes a few days for approval. The first official release, will be available in a few weeks.

Similar to the PGI Community Edition, we plan on making a release available without registration, twice a year.

(Q3) Finally, would CUDA programming be better, given the OpenACC standard related issues?

I’ll respectfully disagree that there’s a standard issue here. But to answer your question, no, I don’t believe moving to CUDA would help in this case. Writing efficient reduction code in CUDA can be quite cumbersome while OpenACC can implicitly generate optimized reductions for you.

We actually created the CUF kernels directive in CUDA Fortran, which is similar to OpenACC, mostly because writing reductions in CUDA is such a pain. Much easier to let the compiler do it for you.

Thank you Mat.
I will check out the NVIDIA’s HPC SDK