Clarification of reduction variables with implicit copy

Hi,

I noticed in the changelog of PGI 20.1:

Changed copy behavior for OpenACC reductions to adhere to the OpenACC specification. The compilers will now follow the defined behavior for copy and not copy the reduction variable to the device if it is already present. To enable the compiler’s previous behavior, use an update device/host if_present directive.

I was wondering on some clarification on this.

If I have “sum=0” followed by a reduction loop with OpenACC on it, as far I understood the “sum1” with its 0 value is copied to the device and the resulting sum is stored in “sum” and is available on the CPU.
On the second call, “sum” is again copied to the device with a “0” value.

When I compile my code I see:

Generating implicit copy(sum1) [if not already present]

Does this mean that “sum1” is copied to the device with 0-value the first time, but on the second call it retains its previous final value and is not reset to 0 on the device?

If so, this is a major change and will break all my codes unless update all collectives to manually manage the “sum” scalars and update them manually.

Is this correct?

I had thought that all scalars are treated differently than arrays when it comes to copy, update etc in that they are all implicitly handled to do what the programmer expects (i.e. one does not have to add all scalars to the GPU memory manually, copy them, reset them in a kernel, etc).
Is this no longer true in general or just for collectives?

  • Ron

Update:

It seems the code is working correctly.

Does this mean an implicit copy of the reduction variable is taken off the device at the end of the ACC loop?

Why is the update suggestion in the documentation needed then?

  • Ron

Hi Ron,

The change only applies when the reduction variable is already present on the device. So if you had this situation:

% cat test.f90

program foo
   integer :: sum, i
   sum = 0
!$acc enter data copyin(sum)
   sum = 1
!$acc parallel loop reduction(+:sum)
   do i=1,1000
      sum = sum+1
   enddo
   print *, sum
!$acc exit data copyout(sum)
   print *, sum
end program foo

% pgfortran -ta=tesla test.f90 -V19.10; a.out
         1001
         1001
% pgfortran -ta=tesla test.f90 -V20.1 ; a.out
            1
         1000

With 20.1, “sum” is no longer copied if it’s already present. If it’s not present, there’s no change in behavior.

Hope this helps,
Mat

Hi,

In your example, you used an “enter data” unstructured data clause which keeps the data on the device globally/forever.

If one uses a simple “copy” on the loop itself (or implicit copy from the compiler) are you saying that the “sum” variable gets removed from the device after the loop so on next call it is copied again (and hence, no change in behavior from before)?

  • Ron

Hi Ron,

you used an “enter data” unstructured data clause which keeps the data on the device globally/forever.

Not quite. The data on the device is deallocated at the “exit data” directive which I have after the loop.

If one uses a simple “copy” on the loop itself (or implicit copy from the compiler) are you saying that the “sum” variable gets removed from the device after the loop so on next call it is copied again (and hence, no change in behavior from before)?

The scope of a copy(sum) is the loop so would be removed at the end of the loop and would get copied again on the next loop. There’s no change of behavior.

-Mat

Hi,

I am getting a different result in a reduction summation in my code with PGI 20.1. I am not sure if it has to do with the new behavior or if there has been a major change in the summation implementation.

If I run the code on the CPU using GNU and on the GPU using PGI 19.10 I get identical results.
If I run it on the GPU with PGI 20.1, the results of the summations are off by a percent or so (much larger than floating point differences should be.

My code is as follows.
Can you spot any issues that might be causing the problem? Does it have to do with this change in copy?
In the code, x, y, dp, and source are already on the device from enter data statements earlier.

      integer :: j,k
      real(r_typ) :: fn2_fn1,fs2_fs1
      real(r_typ), dimension(ntm1,npm1) :: x,y
c
      fn2_fn1=0.
      fs2_fs1=0.
c
!$acc parallel loop collapse(2) default(present) async(1)
      do k=2,npm2
        do j=2,ntm2
          y(j,k)=source(j,k)
     &          +coef(j,k,1)*x(j,  k-1)
     &          +coef(j,k,2)*x(j-1,k  )
     &          +coef(j,k,3)*x(j  ,k  )
     &          +coef(j,k,4)*x(j+1,k  )
     &          +coef(j,k,5)*x(j,  k+1)
        enddo
      enddo
c
!$acc parallel loop default(present) reduction(+:fn2_fn1,fs2_fs1)
!$acc& async(2)
      do k=1,npm2
        fn2_fn1=fn2_fn1+(x(2   ,k)-x(1   ,k))*dp(k)
        fs2_fs1=fs2_fs1+(x(ntm2,k)-x(ntm1,k))*dp(k)
      enddo
c
!$acc parallel loop default(present) async(3)
       do j=2,ntm2
         y(j,1)=source(j,1)
     &         +coef(j,1,1)*x(j  ,npm2)
     &         +coef(j,1,2)*x(j-1,1)
     &         +coef(j,1,3)*x(j  ,1)
     &         +coef(j,1,4)*x(j+1,1)
     &         +coef(j,1,5)*x(j  ,2)
c
         y(j,npm1)=y(j,1)
       enddo
c
!$acc parallel loop default(present) async(2)
       do k=1,npm1
         y(   1,k)=source(   1,k)+d2t_j1*fn2_fn1
         y(ntm1,k)=source(ntm1,k)+d2t_jntm1*fs2_fs1
       enddo
c
!$acc wait

Hi Ron,

This is related. The intended behavior of a parallel loop with both a reductions and an async clause is to delay the copy of the reduction value to the host until the wait is encountered. However with the old behavior, a loop with a reduction would block waiting to copy back the reduction and thus inhibit the async. The fix now gets us back to the intended non-blocking behavior.

The problem here is that fn2_fn1 and fs2_fs1 have not yet been updated on the host since the wait hasn’t been encountered and the old values of “0” are used in the second loop.

The easiest fix is to put these variables in a data region spanning before the first loop and after the wait directive. Hence the device value fn2_fn1 and fs2_fs1 are used in the second loop rather than be copied from the host.

Hope this helps,
Mat

“The problem here is that fn2_fn1 and fs2_fs1 have not yet been updated on the host”

So even though they are on the device in the first reduction loop, they are recopied to and from the host for the second loop, but in 20.1, the first loop does not copy to the host yet, and the second loop doesn’t know that the scalars are on the device because of scoping?

Time to sift through my codes… :)

  • Ron