Optimizing 3D stencil loops in OpenACC

Hi,

45% of my code is in a “A*x” sparse matrix-vector product.

I have spent some amount of time playing with loop clauses, tiles, cache, etc and so far the fastest solution was to just use “loop,loop,loop” as shown below.

I am wondering if there has been an example 3D stencil code where the optimal clauses was found.

Any other ways to optimize the below code?

(I am using Titan XP, P100, and V100 GPUs and PGI 17.9)

!$acc parallel default(present) present(ps) private(ii)
!$acc loop
      do k=2,npm1
!$acc loop
        do j=2,ntm1
!$acc loop
          do i=2,nrm-1
            ii=ntm2*(nrm-2)*(k-2)+(nrm-2)*(j-2)+(i-1)
            q(ii)=a_r( 1,i,j,k)*ps%r(i  ,j  ,k-1)
     &           +a_r( 2,i,j,k)*ps%r(i  ,j-1,k  )
     &           +a_r( 3,i,j,k)*ps%r(i-1,j  ,k  )
     &           +a_r( 4,i,j,k)*ps%r(i  ,j  ,k  )
     &           +a_r( 5,i,j,k)*ps%r(i+1,j  ,k  )
     &           +a_r( 6,i,j,k)*ps%r(i  ,j+1,k  )
     &           +a_r( 7,i,j,k)*ps%r(i  ,j  ,k+1)
     &           +a_r( 8,i,j,k)*ps%t(i  ,j-1,k  )
     &           +a_r( 9,i,j,k)*ps%t(i+1,j-1,k  )
     &           +a_r(10,i,j,k)*ps%t(i  ,j  ,k  )
     &           +a_r(11,i,j,k)*ps%t(i+1,j  ,k  )
     &           +a_r(12,i,j,k)*ps%p(i  ,j  ,k-1)
     &           +a_r(13,i,j,k)*ps%p(i+1,j  ,k-1)
     &           +a_r(14,i,j,k)*ps%p(i  ,j  ,k  )
     &           +a_r(15,i,j,k)*ps%p(i+1,j  ,k  )
          enddo
        enddo
      enddo
c
c ****** Get the t component.
c
!$acc loop
      do k=2,npm1
!$acc loop
        do j=2,ntm-1
!$acc loop
          do i=2,nrm1
            ii=(npm2*ntm2*(nrm-2))
     &         +(ntm-2)*nrm2*(k-2)+nrm2*(j-2)+(i-1)
            q(ii)=
     &           a_t( 1,i,j,k)*ps%r(i-1,j  ,k  )
     &          +a_t( 2,i,j,k)*ps%r(i  ,j  ,k  )
     &          +a_t( 3,i,j,k)*ps%r(i-1,j+1,k  )
     &          +a_t( 4,i,j,k)*ps%r(i  ,j+1,k  )
     &          +a_t( 5,i,j,k)*ps%t(i  ,j  ,k-1)
     &          +a_t( 6,i,j,k)*ps%t(i  ,j-1,k  )
     &          +a_t( 7,i,j,k)*ps%t(i-1,j  ,k  )
     &          +a_t( 8,i,j,k)*ps%t(i  ,j  ,k  )
     &          +a_t( 9,i,j,k)*ps%t(i+1,j  ,k  )
     &          +a_t(10,i,j,k)*ps%t(i  ,j+1,k  )
     &          +a_t(11,i,j,k)*ps%t(i  ,j  ,k+1)
     &          +a_t(12,i,j,k)*ps%p(i  ,j  ,k-1)
     &          +a_t(13,i,j,k)*ps%p(i  ,j+1,k-1)
     &          +a_t(14,i,j,k)*ps%p(i  ,j  ,k  )
     &          +a_t(15,i,j,k)*ps%p(i  ,j+1,k  )
          enddo
        enddo
      enddo
c
c ****** Get the p component.
c
!$acc loop
      do k=2,npm-1
!$acc loop
        do j=2,ntm1
!$acc loop
          do i=2,nrm1
            ii=(npm2*ntm2*(nrm-2))+(npm2*(ntm-2)*nrm2)
     &         +ntm2*nrm2*(k-2)+nrm2*(j-2)+(i-1)
            q(ii)=
     &            a_p( 1,i,j,k)*ps%r(i-1,j  ,k  )
     &           +a_p( 2,i,j,k)*ps%r(i  ,j  ,k  )
     &           +a_p( 3,i,j,k)*ps%r(i-1,j  ,k+1)
     &           +a_p( 4,i,j,k)*ps%r(i  ,j  ,k+1)
     &           +a_p( 5,i,j,k)*ps%t(i  ,j-1,k  )
     &           +a_p( 6,i,j,k)*ps%t(i  ,j  ,k  )
     &           +a_p( 7,i,j,k)*ps%t(i  ,j-1,k+1)
     &           +a_p( 8,i,j,k)*ps%t(i  ,j  ,k+1)
     &           +a_p( 9,i,j,k)*ps%p(i  ,j  ,k-1)
     &           +a_p(10,i,j,k)*ps%p(i  ,j-1,k  )
     &           +a_p(11,i,j,k)*ps%p(i-1,j  ,k  )
     &           +a_p(12,i,j,k)*ps%p(i  ,j  ,k  )
     &           +a_p(13,i,j,k)*ps%p(i+1,j  ,k  )
     &           +a_p(14,i,j,k)*ps%p(i  ,j+1,k  )
     &           +a_p(15,i,j,k)*ps%p(i  ,j  ,k+1)
          enddo
        enddo
      enddo
!$acc end parallel

Hi Ron,

From the compiler feedback messages, how is the compiler scheduling these loops. My guess it’s gang outer, seq middle, and vector inner.

I’d recommend trying to collapse the two outer loops using gang and putting vector on the inner loop.

If “nrm” is small (like < 128), then you might try collapsing the two inner loops, or schedule the inner loop as “worker”.

How hard would it be for you to change the layout of the “a_r|t|p” arrays? You’ll might get better data access if you move the first dimension to the last. ex. “a_r(i,j,k,1)”, since the vectors can then access the data contiguously.

Also, unless “ii” is a module variable you shouldn’t need to privatize it. Put a scalar in a private clause will create an array of the variable which each thread will need to fetch from global memory. Without private, it should be a local variable and more likely to be put in a local register.

-Mat

Hi,

“change the layout of the “a_r|t|p” arrays”

I am in Fortran so the first index is contiguous.

I have tried collapsing the two loops as suggested but the result is nearly the same.

Thanks for the tip about the private scalar!

When IS there a need for private?

I am in Fortran so the first index is contiguous.

Yes, but you want it contiguous across the vector loop dimension which is “i”.

I have tried collapsing the two loops as suggested but the result is nearly the same.

You might be memory bound, in which case changing the schedule probably wont matter much. Fix the “a_” arrays so that “i” is the contiguous dimension would be what I’d try next.

When IS there a need for private?

There’s a few scenarios.

    • When the scalar has global storage such as a Fortran module variable or if it has the “SAVE” attribute.
  • When the scalar is passed by reference to a device subroutine (default in Fortran). Though the preferred way to fix this is to use the F2003 “value” attribute so the scalar is passed by value.
  • When you have a “live-out”. i.e. when the scalar’s value is used on the host after the compute region.

-Mat