Performance of pgi openaccfor a matrix-matrix multiplication

gongjing · April 30, 2014, 1:43pm

Hi,

The question is related to the two older questions.

The matrix-matrix multiplication likes

do e=1,nel
   do k=1,n
   do j=1,n
   do i=1,n
       tmp = 0.
       do l=1,n
          tmp = tmp + D(i,l)*u(l,j,k,e)+D(j,l)*v(i,l,k,e)+D(k,l)*w(i,j,l,e)
       enddo
       m(i,j,k,e) = tmp
   enddo
   enddo
   enddo
enddo

where typically n=4-16 and nel=10-1000. We can only obtain around maximum 20G FLOPS using pgi compilers on Tesla k20x for n=16 and nel=400.

!$acc kernels
!$acc loop gang
671: do e=1,nel
!$acc loop gang vector
673:   do k=1,n
!$acc loop vector
675:  do j=1,n
677: do i=1,n
       tmp = 0.
679:   do l=1,n
              tmp = tmp + D(i,l)*u(l,j,k,e)+D(j,l)*v(i,l,k,e)+D(k,l)*w(i,j,l,e)
          enddo
       m(i,j,k,e) = tmp
   enddo
   enddo
   enddo
enddo
!$acc end kernels

In that case, should “reduction” be added on the inner-loop ?

 !$ACC LOOP REDUCTION(+:tmp)                                                              
            do l=1,n  ! serial loop, no reductio

Even the reduction is added, the loop “do l=1,n” still is parallelized
…
671, Loop is parallelizable
673, Loop is parallelizable
675, Loop is parallelizable
677, Loop is parallelizable

679, Loop is parallelizable
…
(but performance reduced from 20G to 13G FLOGS)

2) The performance for the kernels on a Fermi GPU was obtained around 23G. Is there any way to improve the performance on K20x ? I have used the compiler flag (-ta=nvidia,5.0) and other openacc implementation have been tested, such as

 !$acc kernels
    do ;
      do; do;
...
  !$acc parallel loop collapse(4) 
     do;
       do, do

but cannot obtain better performance.

Thanks for your assistance.

/Jing

MatColgrove · May 1, 2014, 5:30pm

Hi Jing,

For #1, don’t confuse the message “Loop is parallelizable” with what’s actually scheduled. This message is from the analysis stage and just means that it could be parallelized. Look for the loops schedule messages for what actually was parallelized.

Given the values of “n” are so small, I would collapse the k, j, and I loops together, and just perform the reduction loop sequentially in the kernel. In the just released PGI 14.4, we’ve made considerable improvements with loop collapsing so please consider using this new release.

!$acc kernels 
!$acc loop gang 
671: do e=1,nel 
!$acc loop vector(512) collapse(3) 
673:   do k=1,n 
675:  do j=1,n 
677: do i=1,n 
       tmp = 0. 
679:   do l=1,n 
              tmp = tmp + D(i,l)*u(l,j,k,e)+D(j,l)*v(i,l,k,e)+D(k,l)*w(i,j,l,e) 
          enddo 
       m(i,j,k,e) = tmp 
   enddo 
   enddo 
   enddo 
enddo 
!$acc end kernels

For #2, try targeting compute capability 3.5 (i.e. -ta=tesla:cc35) and use the “INTENT(IN)” attribute on your read-only arrays. In these cases, we attempt to utilize texture memory can help considerably for random-access memory patterns such as how you’re using the “D” array.

Some other general performance ideas:

Given the schedule above, you could also help your memory access a bit by using “j” or “k” as the leading dimension of the “u” array instead of “l”.

Other things to are to disable RDC (-ta=tesla:nordc) assuming you don’t use the “routine” directive. In 14.4, we’ve added an “unroll” option (enabled by default with -O3) which helps a few codes but can slow down others. Worth a try though.

Finally, look at the PTX information for the register usage (-ta=tesla:ptxinfo) and use this information to determine the occupancy. You can then adjust the vector width higher or lower or use “-ta=tesla:maxregcount:xx” to adjust the total register usage per gang to see how it effects occupancy and performance.

Mat

gongjing · May 1, 2014, 9:07pm

Hi Mat,

Thank for your valuable input.

Now the code is changed to

!$acc kernels
!$acc loop gang
671: do e=1,nel
!$acc loop vector(16) collapse(3)
do ...

With pgi 14.2 (13.10 used previously) and flag -ta=tesla:cc35, 40G flops can be obtained for nel=400 and n=16. that indeed improves much performance.

I will try other suggestions, but it seems “-ta=tesla:nordc” does not work for the code, there is a compiler error

Accelerator Fatal Error: No CUDA device code available
File: math.f
Function: zero:1388

But the “zero” function is

subroutine zero(a,n)
      DIMENSION  A(n)
!$ACC DATA PRESENT(A(1:n))                                                                  
!$ACC PARALLEL LOOP                                                                                  
      DO I = 1, N
         A(I) = 0.0

Thanks again.

/Jing

Topic		Replies	Views
should use to "acc reduction" in an inner loop Legacy PGI Compilers	4	4186	December 6, 2012
Performance decrease with PGI 12.1 Legacy PGI Compilers	11	6316	May 10, 2012
matrix reduction using cuda fortran and GPU Legacy PGI Compilers	33	13513	December 21, 2012
MatMul with openACC Legacy PGI Compilers	7	13039	December 17, 2012
Significant deterioration of performance with array reduction in OpenACC Legacy PGI Compilers	7	1029	April 22, 2022
Check performance Legacy PGI Compilers	4	3256	September 28, 2017
License issue when using pgi/20.4 compiler Legacy PGI Compilers	6	298	April 16, 2024
PGI and OpenACC - problem with collapse clause Legacy PGI Compilers	4	6739	May 21, 2014
OPENACC changes value of array Legacy PGI Compilers	12	9688	May 17, 2016
PGI Acc: Matrix-matrix-multiplication Legacy PGI Compilers	3	5176	September 10, 2010

Performance of pgi openaccfor a matrix-matrix multiplication

Related topics