Performance of pgi openaccfor a matrix-matrix multiplication

Hi,

The question is related to the two older questions.

https://forums.developer.nvidia.com/t/better-performance-on-kepler-k20-than-fermi-nvidia-tesla/133514/1
https://forums.developer.nvidia.com/t/should-use-to-acc-reduction-in-an-inner-loop/133339/1

The matrix-matrix multiplication likes

do e=1,nel
   do k=1,n
   do j=1,n
   do i=1,n
       tmp = 0.
       do l=1,n
          tmp = tmp + D(i,l)*u(l,j,k,e)+D(j,l)*v(i,l,k,e)+D(k,l)*w(i,j,l,e)
       enddo
       m(i,j,k,e) = tmp
   enddo
   enddo
   enddo
enddo

where typically n=4-16 and nel=10-1000. We can only obtain around maximum 20G FLOPS using pgi compilers on Tesla k20x for n=16 and nel=400.

!$acc kernels
!$acc loop gang
671: do e=1,nel
!$acc loop gang vector
673:   do k=1,n
!$acc loop vector
675:  do j=1,n
677: do i=1,n
       tmp = 0.
679:   do l=1,n
              tmp = tmp + D(i,l)*u(l,j,k,e)+D(j,l)*v(i,l,k,e)+D(k,l)*w(i,j,l,e)
          enddo
       m(i,j,k,e) = tmp
   enddo
   enddo
   enddo
enddo
!$acc end kernels
  1. In that case, should “reduction” be added on the inner-loop ?
 !$ACC LOOP REDUCTION(+:tmp)                                                              
            do l=1,n  ! serial loop, no reductio

Even the reduction is added, the loop “do l=1,n” still is parallelized

671, Loop is parallelizable
673, Loop is parallelizable
675, Loop is parallelizable
677, Loop is parallelizable

679, Loop is parallelizable

(but performance reduced from 20G to 13G FLOGS)


2) The performance for the kernels on a Fermi GPU was obtained around 23G. Is there any way to improve the performance on K20x ? I have used the compiler flag (-ta=nvidia,5.0) and other openacc implementation have been tested, such as

 !$acc kernels
    do ;
      do; do;
...
  !$acc parallel loop collapse(4) 
     do;
       do, do

but cannot obtain better performance.

Thanks for your assistance.

/Jing

Hi Jing,

For #1, don’t confuse the message “Loop is parallelizable” with what’s actually scheduled. This message is from the analysis stage and just means that it could be parallelized. Look for the loops schedule messages for what actually was parallelized.

Given the values of “n” are so small, I would collapse the k, j, and I loops together, and just perform the reduction loop sequentially in the kernel. In the just released PGI 14.4, we’ve made considerable improvements with loop collapsing so please consider using this new release.

!$acc kernels 
!$acc loop gang 
671: do e=1,nel 
!$acc loop vector(512) collapse(3) 
673:   do k=1,n 
675:  do j=1,n 
677: do i=1,n 
       tmp = 0. 
679:   do l=1,n 
              tmp = tmp + D(i,l)*u(l,j,k,e)+D(j,l)*v(i,l,k,e)+D(k,l)*w(i,j,l,e) 
          enddo 
       m(i,j,k,e) = tmp 
   enddo 
   enddo 
   enddo 
enddo 
!$acc end kernels

For #2, try targeting compute capability 3.5 (i.e. -ta=tesla:cc35) and use the “INTENT(IN)” attribute on your read-only arrays. In these cases, we attempt to utilize texture memory can help considerably for random-access memory patterns such as how you’re using the “D” array.

Some other general performance ideas:

Given the schedule above, you could also help your memory access a bit by using “j” or “k” as the leading dimension of the “u” array instead of “l”.

Other things to are to disable RDC (-ta=tesla:nordc) assuming you don’t use the “routine” directive. In 14.4, we’ve added an “unroll” option (enabled by default with -O3) which helps a few codes but can slow down others. Worth a try though.

Finally, look at the PTX information for the register usage (-ta=tesla:ptxinfo) and use this information to determine the occupancy. You can then adjust the vector width higher or lower or use “-ta=tesla:maxregcount:xx” to adjust the total register usage per gang to see how it effects occupancy and performance.

  • Mat

Hi Mat,

Thank for your valuable input.

Now the code is changed to

!$acc kernels
!$acc loop gang
671: do e=1,nel
!$acc loop vector(16) collapse(3)
do ...

With pgi 14.2 (13.10 used previously) and flag -ta=tesla:cc35, 40G flops can be obtained for nel=400 and n=16. that indeed improves much performance.

I will try other suggestions, but it seems “-ta=tesla:nordc” does not work for the code, there is a compiler error

Accelerator Fatal Error: No CUDA device code available
File: math.f
Function: zero:1388

But the “zero” function is

subroutine zero(a,n)
      DIMENSION  A(n)
!$ACC DATA PRESENT(A(1:n))                                                                  
!$ACC PARALLEL LOOP                                                                                  
      DO I = 1, N
         A(I) = 0.0

Thanks again.

/Jing