Manually collapsed loops give wrong answers

Because earlier versions of PGI Fortran (<=13.9) ignore the “collapse” clause in “loop” directive, I use manually collapsed loops to emulate the “collapse” clause, which generally give much better performance than uncollapsed loops.

Recently I started using PGI v15.5, but found that my codes with manually collapsed loops fail. Included in the link below is a sample code and test data:

http://www4.ncsu.edu/~lluo6/tmp/ev_b.tar.bz2

For convenience I also include the loop section here:

is=iend-istr+1; js=jend-jstr+1
!$acc loop independent
do ijknn=0,(kend-kstr+1)*js*is*6-1
  ijk=ijknn/6; m=ijknn-ijk*6+1
  k=ijk/js/is; j=ijk/is-k*js; i=ijk-k*js*is-j*is+istr; j=j+jstr; k=k+kstr
  b(m,i,j,k) = ev(m,1,i-1,j,k) - ev(m,1,i,j,k)  &
             + ev(m,2,i,j-1,k) - ev(m,2,i,j,k)  &
             + ev(m,3,i,j,k-1) - ev(m,3,i,j,k)
end do

ev is the input data read from fort.80. b is the output.

To compile and run on GPU, which gives output fort.81:
pgf90 -Mpreprocess -acc ev_b.f90 && ./a.out
To compile and run on CPU, which gives output fort.82:
pgf90 -Mpreprocess ev_b.f90 && ./a.out

They should have the same output. However, the GPU version always produce zeros. There is a commented section using standard loops, which always work well on both CPU and GPU.

As far as I know, manually collapsed loops are a very common way to write codes. There is no non-standard Fortran syntax, either. In older PGI versions manually collapsed loops worked perfectly. Does anyone know why they fail now?

Hi rikisyo,

Try moving the computation of “is” and “js” before the “kernels” region.

is=iend-istr+1; js=jend-jstr+1
!$acc kernels copyin(ev) copyout(b)

! Manually collapsed loops
!$acc loop independent
do ijknn=0,(kend-kstr+1)*js*is*6-1
  ijk=ijknn/6; m=ijknn-ijk*6+1
  k=ijk/js/is; j=ijk/is-k*js; i=ijk-k*js*is-j*is+istr; j=j+jstr; k=k+kstr
  b(m,i,j,k) = ev(m,1,i-1,j,k) - ev(m,1,i,j,k)  &
             + ev(m,2,i,j-1,k) - ev(m,2,i,j,k)  &
             + ev(m,3,i,j,k-1) - ev(m,3,i,j,k)
end do

This works for me.

  • Mat

That works! Thanks Mat!

This is really odd, though. There is really no indication during compilation (-Minfo=accel) why the placement of a CPU scalar assignment would affect the GPU kernels.

My understanding is that, just to be safe, warp every GPU loop inside its own kernels construct.

why the placement of a CPU scalar assignment would affect the GPU kernels

This is one change in behavior between older compilers and new ones. Previously, we didn’t handle sequential compute regions very well and would have hoisted the scalar assignments out of the compute region and performed the computation on the host. Later when we added better support for sequential code, the scalar assignment are performed on the device copies of these variables.

These variables are used as part of the loop’s kernel launch configuration, which is evaluated on the host. Given the host copies of “is” and “js” are not updated, the values used have not been initialized (it looks like they happen to be zero so the loop bounds is being computed as 0 to -1).

Hope this helps explain the issue.

  • Mat

So I just stumbled upon one of the more subtle implementation pitfalls. Thanks for the very helpful explanation! You just saved me a lot of debug labor in the future.

After migrating all my previous manual collapsed loops to the new “collapse” clause in the correct way, most of them behavior identically. The results are correct. However, one kernel seems to run 10x slower than the manual collapsed version. The code listing is given below:

Manual collapsed version:

!$acc kernels loop independent
Do ip=0,ipoint(ipl)*nv*nv-1
  ie=ip/nv/nv
  ia=(ip-ie*nv*nv)/nv
  ib=ip-ie*nv*nv-ia*nv+1
  ie=ie+1
  ia=ia+1
  i=ijkmap(1,ie,ipl);j=ijkmap(2,ie,ipl);k=ijkmap(3,ie,ipl)
  dd(ib,ia,i,j,k) = dd(ib,ia,i,j,k) &
     - merge(0.,dot_product(ts(1,ib,:,ie),am(:,ia,i+1,j,k)),i==ii) &
     - merge(0.,dot_product(ts(2,ib,:,ie),bm(:,ia,i,j+1,k)),j==jj) &
     - merge(0.,dot_product(ts(3,ib,:,ie),cm(:,ia,i,j,k+1)),k==kk)
End Do
!$acc end kernels

Automatic collapsed loops:

!$acc kernels loop independent collapse(3)
Do ie=1,ipoint(ipl); Do ia=1,nv; Do ib=1,nv
  i = ijkmap(1,ie,ipl); j = ijkmap(2,ie,ipl); k = ijkmap(3,ie,ipl)
  dd(ib,ia,i,j,k) = dd(ib,ia,i,j,k) &
     - merge(0.,dot_product(ts(1,ib,:,ie),am(:,ia,i+1,j,k)),i==ii) &
     - merge(0.,dot_product(ts(2,ib,:,ie),bm(:,ia,i,j+1,k)),j==jj) &
     - merge(0.,dot_product(ts(3,ib,:,ie),cm(:,ia,i,j,k+1)),k==kk)
End Do; End Do; End Do
!$acc end kernels

I checked the GPU code (keepgpu). It seems that the automatic version generates 3x more branching operations than the manual version.

It is interesting to see all my other collapsed loops are working very efficiently after the migration, including another one with almost the same structure.

I checked the GPU code (keepgpu). It seems that the automatic version generates 3x more branching operations than the manual version.

My guess is that this is just the extra bounds checking and the computational portion of the code doesn’t have branches in it. I’d more inclined to look at the data access pattern of the arrays. In the auto-collapse case, the compiler may be generating inner strip mine loops which may not be beneficial when accessing the non-Stride 1 arrays. Of course I’m just guessing so would recommend you use NVVP or nprof to see if the profiler can identify the performance issue.

What’s the compiler feedback message (-Minfo=accel) say about how the loop is being scheduled?

  • Mat

Mat you are right. There is something fishy going on. For the automatic collapse version, I see these output:

    162, Loop is parallelizable
         Accelerator kernel generated
        162, !$acc loop gang collapse(3) ! blockidx%x
               ! blockidx%x collapsed
        164, !$acc loop vector(128) ! threadidx%x
             Sum reduction generated for ts$r,ts$r48,ts$r49

ts$r,ts$r48,ts$r49 are not part of the original coding. They seem like some internal scalars to me, generated by the compiler.

“ts$r,ts$r48,ts$r49” are compiler generated temps associated with the “ts” array and used in the dot products.

What the feedback message is tell me is that since the compiler wants to expose as much parallelism as it can, it’s performing a vector schedule on the dot product loops (dot products get inlined). This in turn causes the inner loop reduction including all the associated overhead (temp arrays, thread synchronization, set-up, final reduction loop, etc). In the vast majority of cases, the compiler does a good job at finding the best schedule, this unfortunately is not one of these cases.

What I would do, is override the compiler’s default by explicitly setting the outer loops schedule to “gang vector”. You might also need to change to use “parallel” instead of “kernels” here. With “kernels”, the compiler is allowed more freedom while with “parallel” it tries to honor exactly what you express.

!$acc kernels loop independent collapse(3) gang vector 
Do ie=1,ipoint(ipl); Do ia=1,nv; Do ib=1,nv 
  i = ijkmap(1,ie,ipl); j = ijkmap(2,ie,ipl); k = ij
  • Mat

Adding “gang vector” (actually “vector” alone is enough) solves the problem. I think I will put this in every collapsed loop, just to be safe.

Thanks again Mat!