reduction clause

I wander if parallel reduction loop was generated…

here is part of my code:

 655 !$acc loop independent gang vector(16)
 656    do i=its,ite
 657 !$acc loop independent gang vector(16)
 658    do j=jts,jte
 659          LT = 0
 660 !$acc loop reduction(ior:LT)
 661          do K=KTS,KTE
 662                LT_ = 1
...
1764  200 CONTINUE
1765          LT = IOR(LT,LT_)
1766         END DO
1767         LTRUE(j,i) = LT
1768    end do
1769    end do

with line 660 being commented PGI reports:

   656, Loop is parallelizable
    658, Loop is parallelizable
         Accelerator kernel generated
        656, !$acc loop gang, vector(16) ! blockidx%x threadidx%x
        658, !$acc loop gang, vector(16) ! blockidx%y threadidx%y
    661, Scalar last value needed after loop for 'lt' at line 1767
         Accelerator restriction: scalar variable live-out from loop: lt
         Inner sequential loop scheduled on accelerator

It’s understandable.

With uncommented line 660 PGI tells:

    656, Loop is parallelizable
    658, Loop is parallelizable
         Accelerator kernel generated
        656, !$acc loop gang, vector(16) ! blockidx%x threadidx%x
        658, !$acc loop gang, vector(16) ! blockidx%y threadidx%y
    661, Loop is parallelizable

Other information about nested loop is not present.

Well. Result is correct, but:
execution time of the kernel is not changed
PGI_ACC_DEBUG tells no reduction is present

  Function 0 = 0x2a1f040 = morr_two_moment_micro_658_gpu
           658 = lineno
    16x16x1 = block size
        -1x-2x1 = grid size
    1x1x1 = unroll
0 = shared memory
             0 = reduction shared memory
             0 = reduction arg
             0 = reduction bytes
           840 = argument bytes
             0 = max argument bytes
             2 = size arguments

So, why does PGI failed to generate reduction op. Are there any restrictions?

Alexey

Hi Alexy,

In order to perform the inner loop reduction, the loop needs to be scheduled using a “vector” but you used “vector” on the outer two loops. Since the inner loop isn’t tightly nested with the outer two, there’s no way to apply the additional “vector” schedule. What happens if you use:

 655 !$acc loop independent gang worker collapse(2)
 656    do i=its,ite 
 657 
 658    do j=jts,jte 
 659          LT = 0 
 660 !$acc loop vector reduction(ior:LT) 
 661          do K=KTS,KTE

This schedule will be better in cases where the “I” and “j” loops as short trip counts. However if “I” and “j” are long, then the tiled gang vector schedule may be better.

The “k” loop should be considered as well. If it has a short trip count or doesn’t do much computation, better to let it run sequentially. Though, if most of the computation is in this loop (which given the line numbers seems to be true), you might be better off collapsing the outer two loops with just a gang schedule so that the inner loop vector use a long width (256 or 512).

If the “LTRUE” assignment is the only computation in the “j” loop, you may consider adding a 3 dimension LTRUE temp array so that the “k” loop can be made tightly nested. Then add second set of loops to perform the 3d to 2d reduction. It uses more memory but most likely will give the best overall performance.

Also, you’ll need to look at the data access of your arrays. You want to make sure the stride-1 dimension uses an index scheduled as a worker or vector.

Granted, these are all just suggestions. You’ll need to experiment as to which schedule will work best with this compute region.

  • Mat

Thank you, Mat!