Hello there,
my openacc code computes summation incorrectly if -acc -ta=tesla:cc35,cc50,cc60 flags are on. I am using PGI/19.7. How can I fix this problem? Please see below:
[ilkhom@a083 MCCC]$ module load pgi/19.7
[ilkhom@a083 MCCC]$ cat sumcheck.f90
program sumcheck
!$acc routine (get_vmat)
implicit none
integer::i
real*8, dimension(1:2,1:2,1:2) :: vmat
!$acc enter data create(vmat)
!$acc parallel
!$acc loop independent
do i=1,2
call get_vmat(i,vmat(i,1:2,1:2))
enddo
!$acc end parallel
!$acc exit data copyout(vmat)
print*,vmat(:,:,:)
end program sumcheck
subroutine get_vmat(i,vmat)
!$acc routine seq
implicit none
integer,intent(in)::i
real*8,intent(out),dimension(1:2,1:2)::vmat
real*8::t
integer::j,k,l
vmat(1:2,1:2)=0.d0
do l=1,2
do j=1,2
do k=1,2
vmat(j,k)=vmat(j,k)+dble(j*j+k*k+l*l+i)
enddo
enddo
enddo
end subroutine get_vmat
[ilkhom@a083 MCCC]$ pgf90 -acc -ta=tesla:cc35,cc50,cc60 sumcheck.f90 && srun -N1 -n1 ./a.out
12.00000000000000 12.00000000000000 17.00000000000000
17.00000000000000 17.00000000000000 17.00000000000000
23.00000000000000 23.00000000000000
[ilkhom@a083 MCCC]$ pgf90 sumcheck.f90 && srun -N1 -n1 ./a.out
11.00000000000000 13.00000000000000 17.00000000000000
19.00000000000000 17.00000000000000 19.00000000000000
23.00000000000000 25.00000000000000
In general, you want to expose as much parallelization as you can but it’s heavily dependent on the code on which should be used. The one caveat to using vector routines, is that in order to support reductions in routines, the compiler will limit the vector length to 32 when using routine vector. So if your main parallel loop is large, it may be better to have the vector loop at this level. Also, you’ll want to have the vector loop correspond to the stride-1 dimension of the arrays (i.e. column 1 in Fortran). So if that access is in the routine, it may better to make it a vector routine.
It’s difficult to give good advice on this example given it’s trivial and the trip counts are so small. But assuming it’s representative of your code, you want to balance exposing more parallelism (i.e. routine vector) vs stride-1 access of the “i” loop (i.e. routine seq). Given it’s very easy in OpenACC to change schedules, the best thing to do is try different schedules and determine which is better. Use a profiler such as Nsight-Systems, Nsight-Compute, or the compiler runtime profiler which is enabled via the environment variable NV_ACC_TIME=1 to record the kernel times of each schedule you try.