!$ACC routine seq computes summation incorrectly

Hello there,
my openacc code computes summation incorrectly if -acc -ta=tesla:cc35,cc50,cc60 flags are on. I am using PGI/19.7. How can I fix this problem? Please see below:

[ilkhom@a083 MCCC]$ module load pgi/19.7
[ilkhom@a083 MCCC]$ cat sumcheck.f90 
program sumcheck 
!$acc routine (get_vmat)
 implicit none
 integer::i
 real*8, dimension(1:2,1:2,1:2) :: vmat

!$acc enter data create(vmat)
!$acc parallel
!$acc loop independent 
do i=1,2
 call get_vmat(i,vmat(i,1:2,1:2))
enddo
!$acc end parallel
!$acc exit data copyout(vmat)
print*,vmat(:,:,:)

end program sumcheck

subroutine get_vmat(i,vmat)
!$acc routine seq
implicit none
integer,intent(in)::i
real*8,intent(out),dimension(1:2,1:2)::vmat
real*8::t
integer::j,k,l

vmat(1:2,1:2)=0.d0
do l=1,2
 do j=1,2
  do k=1,2
   vmat(j,k)=vmat(j,k)+dble(j*j+k*k+l*l+i)
  enddo
 enddo
enddo
end subroutine get_vmat
[ilkhom@a083 MCCC]$ pgf90 -acc -ta=tesla:cc35,cc50,cc60 sumcheck.f90 && srun -N1 -n1 ./a.out 
  12.00000000000000     12.00000000000000     17.00000000000000    
  17.00000000000000     17.00000000000000     17.00000000000000    
  23.00000000000000     23.00000000000000   
[ilkhom@a083 MCCC]$ pgf90 sumcheck.f90 && srun -N1 -n1 ./a.out 
  11.00000000000000     13.00000000000000     17.00000000000000    
  19.00000000000000     17.00000000000000     19.00000000000000    
  23.00000000000000     25.00000000000000

Hi lkhom,

Try updating your compiler to a more recent version: NVIDIA HPC SDK Current Release Downloads | NVIDIA Developer

I can recreate the problem in 19.7, but it seems to have been a known issue that was fixed sometime in early 2020.

If for some reason you can’t upgrade, I was also able to work around the issue by changing get_vmat to a vector routine:

% cat test.F90

subroutine get_vmat(i,vmat)
!$acc routine vector
implicit none
integer,intent(in)::i
real*8,intent(out),dimension(1:2,1:2)::vmat
real*8::t
integer::j,k,l

vmat(1:2,1:2)=0.d0
!$acc loop vector collapse(2)
do j=1,2
  do k=1,2
   do l=1,2
     vmat(j,k)=vmat(j,k)+dble(j*j+k*k+l*l+i)
  enddo
 enddo
enddo
end subroutine get_vmat

program sumcheck
 implicit none
!$acc routine(get_vmat) vector
 integer::i
 real*8, dimension(1:2,1:2,1:2) :: vmat

!$acc enter data create(vmat)
!$acc parallel loop gang
do i=1,2
 call get_vmat(i,vmat(i,1:2,1:2))
enddo
!$acc exit data copyout(vmat)
print*,vmat(:,:,:)

end program sumcheck

% pgf90 test.F90 -acc -V19.7; a.out
    11.00000000000000         13.00000000000000         17.00000000000000
    19.00000000000000         17.00000000000000         19.00000000000000
    23.00000000000000         25.00000000000000

Hope this helps,
Mat

Hi Mat,
thanks a lot for suggesting a work around. How does !$acc routine seq compare to !$acc routine vector performance wise?

In general, you want to expose as much parallelization as you can but it’s heavily dependent on the code on which should be used. The one caveat to using vector routines, is that in order to support reductions in routines, the compiler will limit the vector length to 32 when using routine vector. So if your main parallel loop is large, it may be better to have the vector loop at this level. Also, you’ll want to have the vector loop correspond to the stride-1 dimension of the arrays (i.e. column 1 in Fortran). So if that access is in the routine, it may better to make it a vector routine.

It’s difficult to give good advice on this example given it’s trivial and the trip counts are so small. But assuming it’s representative of your code, you want to balance exposing more parallelism (i.e. routine vector) vs stride-1 access of the “i” loop (i.e. routine seq). Given it’s very easy in OpenACC to change schedules, the best thing to do is try different schedules and determine which is better. Use a profiler such as Nsight-Systems, Nsight-Compute, or the compiler runtime profiler which is enabled via the environment variable NV_ACC_TIME=1 to record the kernel times of each schedule you try.

-Mat

Thank you for shedding some light. I will keep this in mind.