loop reodering with compiler directive

Hi,

I am currently working on porting a large code with acc directives which consist of many part with typically the following structure:

Version A:
!$acc region
!init  
do k=1,nlev
  do i=1,N
    a(i,k)=0.0D0
  end do
end do
 ! first layer
do i=1,N
  a(i,1)=0.1D0
end do
! vertical computation
do k=2,nlev
  do i=1,N
    a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*a(i,k)
  end do
end do
!$acc end region

of course to get a better performance one needs here to take out the i loop:

Version B
!$acc region 
   do i=1,N
      !init
      do k=1,nlev   
         a(i,k)=0.0D0 
      end do      
      ! first layer
      a(i,1)=0.1D0      
      ! vertical computation
      do k=2,nlev
         a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*a(i,k)
      end do
      
   end do
!$acc end region

For most of the code, only doing this loop reordering is enough to get good performance (for the most intensive part we do have to introduce some additional optimization like replacing local matrices with scalars …).

For the vast majority of the code where only this loop reordering is enough we would like to keep the same source code for the CPU and GPU version.

As said in one of the PGI insider, we could in principle have the code in version B, the compiler should be able to transform it in version A when compiling for a CPU. However we have seen that this is not always the case with all compiler and we would like to keep the source code in CPU optimize form (i.e. version A), as most users are targeting such architecture.

For the moment we are considering to have our own prepocess scripts to do this loop reordering. I was just wondering if you were considering to have such fonctionnality within the acc directives ? (Something like $acc swap (i,k) …)

Thanks,

Xavier

Xavier: I’ve created a dummy version of a loop like yours, but I’m not sure what performance differences you’re seeing for the two versions. Can you tell me representative sizes for ‘n’ and ‘nlev’? I will do some experiments to see how to get the performance you want without modifying the loop structure.
-mw

To continue my last note, what I’m interested in is whether the performance impact comes from treating the whole region as a single kernel, or from the kernel schedules used when they are three separate kernels.

-mw

Hi Micheal,

Indeed, I forgot to give the size of the problem: N=10000 and nlev=60.

I also think that treating the whole region as a single kernel is probably what makes version B much faster. (I am seeing a difference of about a factor 2.7)

In particular there is a k dependence for the last loops (named vertical computation in the example). In version A I think that the kernel generated for the i loop:

  do i=1,N 
    a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*a(i,k) 
  end do

is launched nlev-1 times, which is probably not very efficient.

-Xavier

Hi

I am trying to revive this post. I am actually still interested to know what is the best suggested approach given a code in form A, to have it running efficiently on both CPU and GPU.

To distinguish whether the difference came from having one region, here are two modified example:

Version A2

program main
  implicit none
  integer*4 :: N,nlev,i,k,itime,nt
  real*8, allocatable :: a(:,:), b(:,:)
  integer*4 :: dt1(8), dt2(8), t1, t2
  real*8 :: rt

  N=1E4
  nlev=60
  nt=1000

 allocate(a(N,nlev))
 allocate(b(N,nlev))


b=0.1

!$acc data region local(a,b)
!$acc update device(b)

call date_and_time( values=dt1 )
!time loop
do itime=1,nt
   !initialization
   !$acc region
   do k=1,nlev
      do i=1,N
         a(i,k)=0.0D0
      end do
   end do
   !$acc end region


   ! first layer
   !$acc region
   do i=1,N
      a(i,1)=0.1D0
   end do
   !$acc end region

   ! vertical computation
   !$acc region
   do k=2,nlev
      do i=1,N
         a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*b(i,k)
      end do
   end do
   !$acc end region


end do
!$acc update host(a)

call date_and_time( values=dt2 )

!$acc end data region

 t1 = dt1(8) + 1000*(dt1(7)+60*dt1(6)+60*(dt1(5)))
 t2 = dt2(8) + 1000*(dt2(7)+60*dt2(6)+60*(dt2(5)))
 rt = (t2 - t1)/1000.
write(*,"(A,I,A,I)") 'N=', N, ' , nt=',nt
write(*,"(A,F10.4)") 'time per step (us) =', rt/nt * 1E6 
print*, 'sum(a)=',sum(a)
end program main

and

Version B2

program main
  implicit none
  integer*4 :: N,nlev,i,k,itime,nt
  real*8, allocatable :: a(:,:), b(:,:)
  integer*4 :: dt1(8), dt2(8), t1, t2
  real*8 :: rt

  N=1E4
  nlev=60
  nt=1000

 allocate(a(N,nlev))
 allocate(b(N,nlev))


b=0.1

!$acc data region local(a,b)
!$acc update device(b)

call date_and_time( values=dt1 )
!time loop
do itime=1,nt
   !initialization
   !$acc region do kernel
   do i=1,N
      do k=1,nlev
         a(i,k)=0.0D0
      end do
   end do
   !$acc end region


   ! first layer
   !$acc region do kernel
   do i=1,N
      a(i,1)=0.1D0
   end do
   !$acc end region

   ! vertical computation
   !$acc region do kernel
   do i=1,N
      do k=2,nlev
         a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*b(i,k)
      end do
   end do
   !$acc end region 


end do
!$acc update host(a)

call date_and_time( values=dt2 )

!$acc end data region

 t1 = dt1(8) + 1000*(dt1(7)+60*dt1(6)+60*(dt1(5)))
 t2 = dt2(8) + 1000*(dt2(7)+60*dt2(6)+60*(dt2(5)))
 rt = (t2 - t1)/1000.
write(*,"(A,I,A,I)") 'N=', N, ' , nt=',nt
write(*,"(A,F10.4)")  'time per step (us) =', rt/nt * 1E6
print*, 'sum(a)=',sum(a)
end program main

Version B2 is about twice as fast as A2 on the GPU but on the CPU A2 is faster. Note that for the CPU I compiled only with -O3 optimization, I am assuming here that not all compiler will do the loop tranformation from GPU to CPU optimal order.

Thanks,

Xavier

Hi Xavier,

Note that for the CPU I compiled only with -O3 optimization, I am assuming here that not all compiler will do the loop tranformation from GPU to CPU optimal order.

Use -fast instead. -fast will perform a loop-interchange and vectorize the code. On my system both A and B are 8 seconds with -fast versus 10 and 13 with -O3.

As for the GPU versions, in both cases the k loop can’t be parallelized so is made sequential in the kernel. The difference is that in the A version the compiler uses a 256x2 shared cache while in the B version no caching. If I use the flag “-ta=nvidia,nocache” on A and adjust the schedule to use vector(256), the get the same time as B. My assumption is that the overhead of cache outweighs the gains in this example.

I’ll ask Michael to take a look once he’s back in the office next week.

Best Regards,
Mat

As Mat points out, the compiler is trying to use the cache (shared memory) of the GPU, and in this case it’s a bad decision. We’ll work on that decision process. I created a version for the GPU that works as fast as b.f90 and still retains the same host performance.

program main
  implicit none
  integer*4 :: N,nlev,i,k,itime,nt
  real*8, allocatable :: a(:,:), b(:,:)
  integer*4 :: dt1(8), dt2(8), t1, t2
  real*8 :: rt

  N=1E4
  nlev=60
  nt=1000

 allocate(a(N,nlev))
 allocate(b(N,nlev))


b=0.1

!$acc data region local(a,b)
!$acc update device(b)

call date_and_time( values=dt1 )
!time loop
do itime=1,nt
   !initialization
   !$acc region do seq
   do k=1,nlev
      !$acc do parallel vector(256)
      do i=1,N
         a(i,k)=0.0D0
      end do
   end do
   !$acc end region


   ! first layer
   !$acc region do kernel parallel vector(256)
   do i=1,N
      a(i,1)=0.1D0
   end do
   !$acc end region

   ! vertical computation
   !$acc region do seq
   do k=2,nlev
      !$acc do parallel vector(256)
      do i=1,N
         a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*b(i,k)
      end do
   end do
   !$acc end region


end do
!$acc update host(a)

call date_and_time( values=dt2 )

!$acc end data region

 t1 = dt1(8) + 1000*(dt1(7)+60*dt1(6)+60*(dt1(5)))
 t2 = dt2(8) + 1000*(dt2(7)+60*dt2(6)+60*(dt2(5)))
 rt = (t2 - t1)/1000.
write(*,"(A,I,A,I)") 'N=', N, ' , nt=',nt
write(*,"(A,F10.4)")  'time per step (us) =', rt/nt * 1E6
print*, 'sum(a)=',sum(a)
end program main

Note the ‘parallel vector(256)’ clauses for the ‘i’ loops, which tells the compiler to focus only on that loop, to run in vector blocks of 256 in parallel.

Hi,

Thanks, this is really what I was looking for !

I did try also yesterday to add explicit do seq (for k) and parallel, vector(256) (for i) and found also same performance with case A as case B provided that I compile with -ta=nvidia,nocache.

Xavier

Hi,

I have been further exploring the nocache option. One sees that it is kernel dependent whether it is beneficial or not. Is here any though to add a nocahe( list ) clause (like the cache clause) to the loop mapping directive ? I think it could be very useful.

Thanks,

Xavier