Problem with negative increment using PGI 12.3

xlapillonne · May 2, 2012, 3:50pm

Hi,

I am seeing some differences beyond what can be expected from rounding error with an acc code. I have been able to isolate one of the Kernel and to reproduce the problem in a small test code:

program main
  implicit none
  integer*4 :: N,ke_soil,i,istarts,iends,kso
  real*8, allocatable :: zagc(:,:), zagd(:,:), zage(:,:), t_so_new(:,:)
  logical, allocatable :: llandmask(:)

  N=1E3
  ke_soil=4
  istarts=1
  iends=N
allocate(llandmask(N),zagc(N,0:ke_soil), zagd(N,0:ke_soil), zage(N,0:ke_soil+1))
allocate(t_so_new(N,0:ke_soil))

!init 

DO i=istarts, iends
   llandmask(i)=.T.
   zage(i,ke_soil+1)=290.0*(1.0 + 1.0/(i))
END DO

DO kso=0,ke_soil
   DO i=istarts, iends
      zage(i,kso)=(1.0 + 1.0/(i+kso))
      zagd(i,kso)=290.0*(1.0 + 1.0/(i+kso))
      zagc(i,kso)=(1.0 + 1.0/(i+kso))/288.0
   END DO
END DO


  
  !$acc region do seq
  DO kso = ke_soil,0,-1
     !$acc do kernel
      DO i = istarts, iends
        IF (llandmask(i)) THEN          ! land-points only
          zage(i,kso)     = zagd(i,kso) - zagc(i,kso)*zage(i,kso+1)
        ! The surface temperature computed by t_so(i,0,nnew)=zage(i,0) is
        ! presently unused
          t_so_new(i,kso) = zage(i,kso)
        END IF          ! land-points only
      END DO
  END DO                ! soil layers
  !$acc end region



  DO kso=0,ke_soil
     print*, 'kso, mean(t_so_new)', kso, sum(t_so_new(:,kso))/N
  END DO


end program main

If I compile with:
-ta=host

I get:

aprun -n 1 ./test_k_neg_loop
kso, mean(t_so_new) 0 291.1523301069119
kso, mean(t_so_new) 1 290.8646212574333
kso, mean(t_so_new) 2 290.7209103217591
kso, mean(t_so_new) 3 290.6252217333774
kso, mean(t_so_new) 4 290.5473285022244

wit -ta=nvidia

aprun -N 1 ./test_k_neg_loop
kso, mean(t_so_new) 0 292.1672628741607
kso, mean(t_so_new) 1 291.8775595150368
kso, mean(t_so_new) 2 291.7328524090394
kso, mean(t_so_new) 3 291.6364771839715
kso, mean(t_so_new) 4 290.5473285022244

I actually don’t know whether the issue is coming from the negative incrememnt in the k loop or from the array starting at 0.

Thanks for your help,

Xavier

MatColgrove · May 2, 2012, 5:48pm

Hi Xavier,

It looks like the compiler is caching zage in [256x2] blocks and which is most likely causing data a synchronization issue. The work around would be to either disable caching, “-ta=nvidia,nocache”, or invert your loops.

  !$acc region do kernel
  DO i = istarts, iends
    DO kso = ke_soil,0,-1

% pgf90 -ta=nvidia,time neg.f90 -V12.4 -Minfo
main:
     32, Generating copyin(zagd(1:1000,0:4))
         Generating copy(zage(1:1000,0:5))
         Generating copyin(zagc(1:1000,0:4))
         Generating copy(t_so_new(1:1000,0:4))
         Generating copyin(llandmask(1:1000))
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     33, Loop carried dependence of 'zage' prevents parallelization
         Loop carried backward dependence of 'zage' prevents vectorization
     35, Loop is parallelizable
         Accelerator kernel generated
         33, Cached references to size [256x2] block of 'zage'
         35, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
             CC 1.3 : 10 registers; 4172 shared, 16 constant, 0 local memory bytes; 75% occupancy
             CC 2.0 : 15 registers; 4116 shared, 72 constant, 0 local memory bytes; 100% occupancy
     49, sum reduction inlined
% a.out
 kso, mean(t_so_new)            0    292.1672628741607     
 kso, mean(t_so_new)            1    291.8775595150368     
 kso, mean(t_so_new)            2    291.7328524090394     
 kso, mean(t_so_new)            3    291.6364771839715     
 kso, mean(t_so_new)            4    290.5473285022244     

% pgf90 -ta=nvidia,nocache,time neg.f90 -V12.4 -Minfo
main:
     32, Generating copyin(zagd(1:1000,0:4))
         Generating copy(zage(1:1000,0:5))
         Generating copyin(zagc(1:1000,0:4))
         Generating copy(t_so_new(1:1000,0:4))
         Generating copyin(llandmask(1:1000))
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     33, Loop carried dependence of 'zage' prevents parallelization
         Loop carried backward dependence of 'zage' prevents vectorization
     35, Loop is parallelizable
         Accelerator kernel generated
         33, !$acc do seq
         35, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
             CC 1.3 : 11 registers; 60 shared, 8 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 20 registers; 4 shared, 72 constant, 0 local memory bytes; 100% occupancy
     49, sum reduction inlined
% a.out
 kso, mean(t_so_new)            0    291.1523301069119     
 kso, mean(t_so_new)            1    290.8646212574333     
 kso, mean(t_so_new)            2    290.7209103217591     
 kso, mean(t_so_new)            3    290.6252217333774     
 kso, mean(t_so_new)            4    290.5473285022244

I’ve added TPR#18671 to track this issue.

Best Regards,
Mat