help understanding compiler information

brushman · July 3, 2012, 2:17am

Hi, I am trying to optimize a few routines in a fortran code, but I am having trouble getting parrellization in a couple of them.

In one routine, grid, I have put compute directives around the main computational nested do loop. Here is the compiler feedback:

grid:
    550, Loop interchange produces reordered loop nest: 551,550
         Loop unrolled 33 times (completely unrolled)
    551, Loop not vectorized: may not be beneficial
         Unrolled inner loop 8 times
         Residual loop unrolled 1 times (completely unrolled)
    561, Generating copyin(rwx(:lr))
         Generating copyin(x3(:))
         Generating copyin(rwy(:lr))
         Generating copyin(y3(:))
         Generating copy(den(:,:))
         Generating copyin(w3(:))
         Generating copyin(mu(:))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
    562, Loop carried dependence due to exposed use of 'den(:,:)' prevents parallelization
         Accelerator kernel generated
        562, !$acc do seq
             Non-stride-1 accesses for array 'y3'
             Non-stride-1 accesses for array 'x3'
             Non-stride-1 accesses for array 'mu'
             Non-stride-1 accesses for array 'w3'
             CC 1.0 : 34 registers; 112 shared, 48 constant, 0 local memory bytes; 12% occupancy
             CC 2.0 : 26 registers; 0 shared, 148 constant, 0 local memory bytes; 16% occupancy
    569, Complex loop carried dependence of 'den' prevents parallelization
         Loop carried dependence due to exposed use of 'den(:,:)' prevents parallelization
    604, Loop unrolled 33 times (completely unrolled)
    609, Loop unrolled 33 times (completely unrolled)
    614, Loop interchange produces reordered loop nest: 615,614
         Loop unrolled 33 times (completely unrolled)
    615, Loop not vectorized: may not be beneficial

I am particularly interested in what exposed use of ‘den(:,:)’ and complex loop carried dep of den(:,:) mean as I expect these are the more important pieces. Is it possible I can parallelize my code and any advice on how I can do so?

The second routine yields the following compiler info:

cpush:
    385, Generating copy(w2(:))
         Generating copy(y2(:))
         Generating copy(x2(:))
         Generating copy(nos(n))
         Generating copy(ke(n))
         Generating copy(efl(n))
         Generating copy(pfl(n))
         Generating copy(w1(:))
         Generating copy(w3(:))
         Generating copy(y1(:))
         Generating copy(x1(:))
         Generating copy(u2(:))
         Generating copyout(u3(:))
         Generating copy(u1(:))
         Generating copyin(rwx(:lr))
         Generating copy(x3(:))
         Generating copyin(rwy(:lr))
         Generating copy(y3(:))
         Generating copyin(ex(:,:))
         Generating copyin(ey(:,:))
         Generating copyin(mu(:))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
    386, Complex loop carried dependence of 'pfl' prevents parallelization
         Loop carried dependence due to exposed use of 'pfl(n)' prevents parallelization
         Complex loop carried dependence of 'efl' prevents parallelization
         Loop carried dependence due to exposed use of 'efl(n)' prevents parallelization
         Complex loop carried dependence of 'ke' prevents parallelization
         Loop carried dependence due to exposed use of 'ke(n)' prevents parallelization
         Complex loop carried dependence of 'nos' prevents parallelization
         Loop carried dependence due to exposed use of 'nos(n)' prevents parallelization
         Accelerator kernel generated
        386, !$acc do seq
             Non-stride-1 accesses for array 'w2'
             Non-stride-1 accesses for array 'y2'
             Non-stride-1 accesses for array 'x2'
             Non-stride-1 accesses for array 'w1'
             Non-stride-1 accesses for array 'w3'
             Non-stride-1 accesses for array 'y3'
             Non-stride-1 accesses for array 'y1'
             Non-stride-1 accesses for array 'x3'
             Non-stride-1 accesses for array 'x1'
             Non-stride-1 accesses for array 'u2'
             Non-stride-1 accesses for array 'u3'
             Non-stride-1 accesses for array 'u1'
             Non-stride-1 accesses for array 'mu'
             CC 1.0 : 43 registers; 256 shared, 44 constant, 0 local memory bytes; 8% occupancy
             CC 2.0 : 37 registers; 0 shared, 292 constant, 0 local memory bytes; 16% occupancy
    393, Loop is parallelizable

I could also post the subroutines if that would be helpful.

Thanks for any advice/documentation I should read,
Ben

MatColgrove · July 3, 2012, 5:08pm

Hi Ben,

I am particularly interested in what exposed use of ‘den(:,:)’ and complex loop carried dep of den(:,:) mean as I expect these are the more important pieces. Is it possible I can parallelize my code and any advice on how I can do so?

By default, arrays are shared by all threads. Hence, to avoid race conditions, all threads can only write to unique elements in the array. The loop carried dependency message means the multiple threads could be accessing duplicate elements of “den”.

I’d need to see the code to know exactly what the dependency is. However, if you are using a computed index or index loop-up table, the compiler can’t always tell that every index is unique. If you know that all the calculated index are unique, then you can use the loop “independent” clause to assert that loop is indeed independent.

Another possibility is that “den” is a temporary array used to hold intermediary values within the body of the loop. In this case, you would need to privatize the array (via the “private” clause) to have the compiler create a private copy of the array for every thread.

If this is C, then you may need to add the C99 “restrict” keyword to your pointers, or use the flag “-Msafeptr” to assert to the compiler that the memory your pointers point to don’t overlap with other pointers.

Finally, “den” could really have a loop dependency rendering the loop non-parallelizable. In this case you would need to rewrite your code to be parallel.

I could also post the subroutines if that would be helpful.

If the above doesn’t help, please do post an example.

Mat

brushman · July 9, 2012, 10:12pm

Thanks Matt, that was very helpful.

I’m not exactly sure what the situation is, but I have tried both independent and private directives, but I’ve gotten bad output or slower/same times. I am thinking there’s truly a dependancy. If you wouldn’t mind taking a look at grid, which corresponds to the first compiler info block in my original post (I added a compute directive around the main loop):

         subroutine grid

c    source quantities are are calculated: n_i
c    right now only ion quantitities are calculated...
c
         include  'slab.h'
c
         integer m,i,j,l
         real wx0,wx1,wy0,wy1
         real wght
         real xt,yt
         real rhog
c
c     here we set the rho and den equal to zero.
c

c

c
           do 50 i=0,im
              do 60 j=0,jm
                    den(i,j)=0.
 60           continue
 50        continue
c
c     note: that rho and density will be set the same at the end
c           density is a more accurate description of the 
c                   quantity than rho.
c
            dv=(dx*dy)
c$acc region
             do 100 m=1,mm
c
                wght=w3(m)/dv/float(lr)
	        rhog=sqrt(mu(m))/mims
c
c    continue with calculating grid quantities...
c
	do l=1,lr
                xt=x3(m)+rwx(l)*rhog
                yt=y3(m)+rwy(l)*rhog
c
                if(xt.gt.lx)  xt=2.*lx-xt
                if(xt.lt.0.)  xt=-xt
                if(xt.eq.lx)  xt=0.9999*lx
                if(yt.ge.ly)  yt=yt-ly
                if(yt.lt.0.)  yt=yt+ly

	if (ngp.eq.1) then
                i=int(xt/dx+0.5)
                j=int(yt/dy+0.5)
                den(i,j)=den(i,j)+wght
c     
	else
                i=int(xt/dx)
                j=int(yt/dy)
c     
                wx0=float(i+1)-xt/dx 
                wx1=1.-wx0
                wy0=float(j+1)-yt/dy
                wy1=1.-wy0
c
                den(i,j)      =den(i,j)      +wght*wx0*wy0
                den(i+1,j)    =den(i+1,j)    +wght*wx1*wy0
                den(i,j+1)    =den(i,j+1)    +wght*wx0*wy1
                den(i+1,j+1)  =den(i+1,j+1)  +wght*wx1*wy1
	endif
c
	enddo ! end loop over 4pt avg
c
  100         continue
c$acc end region
c
              do 300 j=0,jm
                    den(0,j)=( den(0,j)+den(im,j) )
                    den(im,j)=den(0,j)
  300          continue
c
               do 320 i=0,im
                     den(i,0)=(den(i,0)+den(i,jm))
                     den(i,jm)=den(i,0)
  320               continue
c
                  do 410 i=0,im
                     do 420 j=0,jm
                           den(i,j)=q*den(i,j)/n0
  420                 continue
  410              continue
c
        return
        end
c

MatColgrove · July 9, 2012, 11:12pm

Hi Ben,

The code is calculating the i and j index so you would need to use the “independent” clause to get the code to accelerate. However as you note, you’ll most likely get wrong answers if not all i and j’s are unique. Worse, you also need the i,i+1 and j,j+1 pairs to be unique due to the forward references, i.e. “den(i+1,j)”, “den(i,j+1)”, “den(i+1,j+1)” and they need to be unique for every iteration of the m loop.

One thing you could do with this loop is to manually privatize “den” by adding a third dimension with the first dimension holding the “m” index, i.e. “denTmp(m,i,j)” and then perform a reduction from the 3D to 2D array after the m loop. However, only the m loop would be run in parallel so it’s loop bounds, mm, would need to be sufficiently large for the parallelization to be profitable.

Note that if you do decide to create a denTmp array, be sure to put it in a “local” clause so it doesn’t get copied to/from the device. You’ll then need to put the initialization and reduction on the device, but that should be straight forward.

Below is the basic outline of what I’m thinking. Caveat is I didn’t test this so could have some typos, but hopefully it will illustrate what I have in mind.

Mat

         subroutine grid

c    source quantities are are calculated: n_i
c    right now only ion quantitities are calculated...
c
         include  'slab.h'
c
         integer m,i,j,l
         real wx0,wx1,wy0,wy1
         real wght
         real xt,yt
         real rhog
         real denTmp(0:mm,0:im,0:jm)  
c
c     here we set the rho and den equal to zero.
c

c

c
 !$acc data region copyout(den) local(denTmp)

!$acc region
       do 44 m=0,mm
           do 55 i=0,im
              do 66 j=0,jm
                    denTmp(m,i,j)=0.
 66          continue
 55        continue
 44        continue
!$acc end region


c
c     note: that rho and density will be set the same at the end
c           density is a more accurate description of the
c                   quantity than rho.
c
            dv=(dx*dy)
c$acc region
             do 100 m=1,mm
c
                wght=w3(m)/dv/float(lr)
           rhog=sqrt(mu(m))/mims
c
c    continue with calculating grid quantities...
c
   do l=1,lr
                xt=x3(m)+rwx(l)*rhog
                yt=y3(m)+rwy(l)*rhog
c
                if(xt.gt.lx)  xt=2.*lx-xt
                if(xt.lt.0.)  xt=-xt
                if(xt.eq.lx)  xt=0.9999*lx
                if(yt.ge.ly)  yt=yt-ly
                if(yt.lt.0.)  yt=yt+ly

   if (ngp.eq.1) then
                i=int(xt/dx+0.5)
                j=int(yt/dy+0.5)
                denTmp(m,i,j)=denTmp(m,i,j)+wght
c     
   else
                i=int(xt/dx)
                j=int(yt/dy)
c     
                wx0=float(i+1)-xt/dx
                wx1=1.-wx0
                wy0=float(j+1)-yt/dy
                wy1=1.-wy0
c
                denTmp(m,i,j)      =den(m,i,j)      +wght*wx0*wy0
                denTmp(m,i+1,j)    =den(m,i+1,j)    +wght*wx1*wy0
                denTmp(m,i,j+1)    =den(m,i,j+1)    +wght*wx0*wy1
                denTmp(m,i+1,j+1)  =den(m,i+1,j+1)  +wght*wx1*wy1
   endif
c
   enddo ! end loop over 4pt avg
c
  100         continue
c$acc end region
c
   
! perform the reduction in parallel
!$acc region
          do 166 j=0,jm
             do 155 i=0,im
                  den(i,j) = 0
                  do 144 m=0,mm
                    den(i,j) = den(i,j) + denTmp(m,i,j)
 144          continue
 155        continue
 166        continue
!$acc end region


!$acc end data region
! copy back den here

! you might be able to get the next loops to accelerate as well
              do 300 j=0,jm
                    den(0,j)=( den(0,j)+den(im,j) )
                    den(im,j)=den(0,j)
  300          continue
c
               do 320 i=0,im
                     den(i,0)=(den(i,0)+den(i,jm))
                     den(i,jm)=den(i,0)
  320               continue
c
                  do 410 i=0,im
                     do 420 j=0,jm
                           den(i,j)=q*den(i,j)/n0
  420                 continue
  410              continue

c
        return
        end
c

brushman · October 9, 2012, 1:45am

Hi again, I know its been awhile but I got another question and its related.

Above we manually privatized den as you described (your solution works btw). What’s the difference between what we did and using the do private directive? How come the do private directive doesn’t work in this case?

Thanks,
Ben

MatColgrove · October 9, 2012, 2:21pm

Hi Ben,

How come the do private directive doesn’t work in this case?

If you privatize an array, every thread gets it’s own copy who’s lifetime is the same as the compute region. If you privatized “den” then each thread would be accumulating it’s own copy and then the contents would be destroyed at the end of the region.

What you’d really want here is to be able to use the reduction clause. However the reduction clause only works on scalars, hence you need to add the reduction code yourself.

Mat

brushman · October 9, 2012, 4:45pm

Thanks, I have another quick question:

My understanding is local puts x on the gpu memory and leaves it there for the duration of the data region (no copying back and forth). So then private puts ‘n=# of threads’ copies of x on the gpu, which are destroyed when the compute region ends.

So why doesn’t it let me put a do private(x) inside a data region local(x)? Is it just because private(x) would have some problems of ‘redundancy’ since x was already put on the gpu once?

PGF90-S-0155-A data clause for a variable appears within another region with a data clause for the same variable dentmp.

Thanks,
Ben

MatColgrove · October 9, 2012, 5:00pm

So why doesn’t it let me put a do private(x) inside a data region local(x)? Is it just because private(x) would have some problems of ‘redundancy’ since x was already put on the gpu once?

For “local(x)”, you’ve created a single array in global memory shared by all threads. With “private(x)”, you’ve created N number of x’s in global memory where N is equal to the number of threads and each thread gets their own x.

Now when each thread references x, which x do they get? The shared one or the private one?

This is why x can’t be use in both a local and private clause. The compiler can’t tell which x you want to use.

Hope this helps,
Mat