Slow matmul despite optimization flags used

Hi,

I don’t understand why PGI’s matmul implementation is so slow. The following program
compares three methods of matrix multiplication. Result:

Explicit: 0.013
Dot_prod. 0.604
Matmul: 56.454

Compilation line:

pgf95 -fast -Mipa=fast -Mvect=nosse -O3 -tp=p6 matmul1.f90

   program Test_intr
   implicit NONE

      integer :: I
      real    :: T0, T1
      integer, parameter :: S = 2, N = 10000000
      double precision :: A(s,s), B(s,s)
      double precision :: X(s), Y(s), Z(s)

     A = reshape((/1,2,3,4/), (/s,s/))
     B = reshape((/6,7,8,9/), (/s,s/))
     X = (/ 1.1d0, 2.2d0 /)
     Y = (/ -7d0, 12d0 /)

     call cpu_time(T0)
     do i = -N, N
! i have changed the following code
       Z(1)=A(1,1)*X(1)+A(1,s)*X(s) -B(1,1)*Y(1)-B(1,s)*Y(s)
       a(s,s) = i  !  Against opt.
       Z(s)=A(s,1)*X(1)+A(s,s)*X(s) -B(s,1)*Y(1)-B(s,s)*Y(s)
     end do
     call cpu_time(T1)
     print "(' Explicit:', F8.3)" , T1 - T0

     call cpu_time(T0)
     do i = -N, N
! also equivalent [JvO]  :
       Z(1) = dot_product(A(1,:), X) - dot_product(B(1,:), Y)
       a(s,s) = i  !  Against opt.
       Z(s) = dot_product(A(s,:), X) - dot_product(B(s,:), Y)
     end do
     call cpu_time(T1)
     print "(' Dot_prod.', F8.3)" , T1 - T0

     call cpu_time(T0)
     do i = -N, N
! to the equivalent code
       Z = MATMUL(A, X) - MATMUL(B, Y)
       a(s,s) = i  !  Against opt.
     end do
     call cpu_time(T1)
     print "(' Matmul:  ', F8.3)" , T1 - T0
   end program Test_intr

Regards,
Jamie

Hi Jamie,

For a small simple 2x2 matrix, then your first two loops can give you a performance boost. Since MATMUL needs to accommodate a variety of shapes, it has a fixed overhead. As the complexity increases, you find that the performance difference becomes less. Also, using MATMUL is much easier and more flexible. Do you really want to re-write your code anytime the shape of a matrix changes?

  • Mat

Hi Mat,

You are absolutely right. I checked matmul speed for larger arrays and it works
fast. Version 6.1 is a big step forward in my opinion. It is the first version which
compiles my code without any modifications (f95 + OpenMP 2.5 + few common extensions).

Thanks,
Jamie

Thanks Jamie. A lot of effort went into getting our OpenMP implementation to be 2.5 compliant so I’m glad to hear that it can be put to good use. You’ll also be glad to know that in the past year the Portland Group has joined the OpenMP ARB (Architecture Review Board) and are helping shape the future of OpenMP.

  • Mat