Hi all!
- In my code I have four nested loops. To avoid reduction I mark two most inner loops as sequential. According to compiler output PGI try to parallelize inner loops.
code:
!$acc data copyout(scf) copyin(Dlocal,Clocal,endpht,lstpht,ilocal)
!$acc kernels
! Loop over sub-points
!$acc loop independent
do ispin = 1,nspin ! <-- 321
!$acc loop independent private(ijl)
do isp = 1,nsp ! <-- 325
!$acc loop seq
do ic = 1,nc ! <-- 328
imp = endpht(ip-1) + ic
i = lstpht(imp)
il = ilocal(i)
!$acc loop seq
do jc = 1,ic ! <-- 335
jl =ilocal(lstpht(endpht(ip-1) + jc)) !ilc(jc)
if (il.gt.jl) then
ijl = il*(il+1)/2 + jl + 1
else
ijl = jl*(jl+1)/2 + il + 1
endif
if (ic .eq. jc) then
Dij = Dlocal(ijl,ispin)
else
Dij = 2*Dlocal(ijl,ispin)
endif
scf(isp,ip,ispin) = scf(isp,ip,ispin) + &
Dij*Clocal(isp,ic) * Clocal(isp,jc) !Cij(isp)
enddo
enddo
enddo
enddo
!$acc end kernels
!$acc end data
output:
pgfortran -c -acc -ta=nvidia:4.0 -g -Minfo `FoX/FoX-config --fcflags` scf.f90
rhoofd:
94, maxval reduction inlined
134, Possible copy in and copy out of dscfl in call to matdot
202, Invariant if transformation
304, sum reduction inlined
319, Generating copyout(scf(:,:,:))
Generating copyin(ilocal(:))
Generating copyin(lstpht(:))
Generating copyin(endpht(:))
Generating copyin(clocal(:,:))
Generating copyin(dlocal(:,:))
320, Generating copyin(endpht(:))
Generating copyin(dlocal(:,:))
Generating copyin(lstpht(:))
Generating copyin(ilocal(:))
Generating copyout(scf(:,:,:))
Generating copyin(clocal(:,:))
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
323, Loop is parallelizable
325, Loop is parallelizable
328, Loop carried dependence of 'scf' prevents parallelization
Loop carried backward dependence of 'scf' prevents vectorization
Accelerator kernel generated
323, !$acc loop gang ! blockidx%y
325, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
328, CC 1.3 : 27 registers; 224 shared, 4 constant, 0 local memory bytes
CC 2.0 : 27 registers; 0 shared, 240 constant, 0 local memory bytes
335, Complex loop carried dependence of 'scf' prevents parallelization
Loop carried dependence of 'scf' prevents parallelization
Loop carried backward dependence of 'scf' prevents vectorization
-
Once again about confusing messages on line 320.
-
BTW, this piece of code produce different result being compiled with and without ‘-acc’. Any idea?