PGI attempts to parallelize sequential loop

Hi all!

  1. In my code I have four nested loops. To avoid reduction I mark two most inner loops as sequential. According to compiler output PGI try to parallelize inner loops.

code:

!$acc data copyout(scf) copyin(Dlocal,Clocal,endpht,lstpht,ilocal)
!$acc kernels
!  Loop over sub-points
!$acc loop independent
        do ispin = 1,nspin  ! <-- 321
!$acc loop independent private(ijl)
           do isp = 1,nsp    ! <-- 325
!$acc loop seq
              do ic = 1,nc     ! <-- 328
                 imp = endpht(ip-1) + ic
                 i = lstpht(imp)
                 il = ilocal(i)
!$acc loop seq
                 do jc = 1,ic   ! <-- 335
                    jl =ilocal(lstpht(endpht(ip-1) + jc)) !ilc(jc)

                    if (il.gt.jl) then
                       ijl = il*(il+1)/2 + jl + 1
                    else
                       ijl = jl*(jl+1)/2 + il + 1
                    endif
                    if (ic .eq. jc) then
                       Dij = Dlocal(ijl,ispin)
                    else
                       Dij = 2*Dlocal(ijl,ispin)
                    endif

                    scf(isp,ip,ispin) = scf(isp,ip,ispin) + &
                        Dij*Clocal(isp,ic) * Clocal(isp,jc)    !Cij(isp)
              enddo
            enddo
          enddo
        enddo
!$acc end kernels
!$acc end data

output:

pgfortran -c -acc -ta=nvidia:4.0 -g -Minfo   `FoX/FoX-config --fcflags`   scf.f90
rhoofd:
     94, maxval reduction inlined
    134, Possible copy in and copy out of dscfl in call to matdot
    202, Invariant if transformation
    304, sum reduction inlined
    319, Generating copyout(scf(:,:,:))
         Generating copyin(ilocal(:))
         Generating copyin(lstpht(:))
         Generating copyin(endpht(:))
         Generating copyin(clocal(:,:))
         Generating copyin(dlocal(:,:))
    320, Generating copyin(endpht(:))
         Generating copyin(dlocal(:,:))
         Generating copyin(lstpht(:))
         Generating copyin(ilocal(:))
         Generating copyout(scf(:,:,:))
         Generating copyin(clocal(:,:))
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
    323, Loop is parallelizable
    325, Loop is parallelizable
    328, Loop carried dependence of 'scf' prevents parallelization
         Loop carried backward dependence of 'scf' prevents vectorization
         Accelerator kernel generated
        323, !$acc loop gang ! blockidx%y
        325, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
        328, CC 1.3 : 27 registers; 224 shared, 4 constant, 0 local memory bytes
             CC 2.0 : 27 registers; 0 shared, 240 constant, 0 local memory bytes
    335, Complex loop carried dependence of 'scf' prevents parallelization
         Loop carried dependence of 'scf' prevents parallelization
         Loop carried backward dependence of 'scf' prevents vectorization
  1. Once again about confusing messages on line 320.

  2. BTW, this piece of code produce different result being compiled with and without ‘-acc’. Any idea?

Hi Alexey,

  1. In my code I have four nested loops. To avoid reduction I mark two most inner loops as sequential. According to compiler output PGI try to parallelize inner loops.

The compiler is just printing out the analysis information, but isn’t actually parallelizing the inner two loops.

  1. Once again about confusing messages on line 320.

Yep. These are actually “present” checks to allow for things like pointer swapping within data regions. Issue is being tracked as TPR#18858.

  1. BTW, this piece of code produce different result being compiled with and without ‘-acc’. Any idea?

I’d need a reproducing example to tell. Though, I’d start by simplifying things. Remove the data region and loop clauses. Next start adding them back one by one, starting with the outer loop then finally the data region.

Hope this helps,
Mat

  1. In my code I have four nested loops. To avoid reduction I mark two most inner loops as sequential. According to compiler output PGI try to parallelize inner loops.

The compiler is just printing out the analysis information, but isn’t actually parallelizing the inner two loops.

In this case I’d suggest to consider this as confusing messages. I marked those loops as seq. explicitly. Therefore I don’t want to see any info about them.

I’ve complained about this as well. The problem is that the analysis is done before the directives are applied. Though, I’ll pass this along since customer complaints tend to get higher priority then when I complain ;).

  • Mat