Why library based DGEMM slower than DGEMM code itself?

Sharp · November 20, 2008, 9:04pm

Hi, I have developed the following code to calculate matrix multiplication.

      do i=1,n
        do j=1,n
          XA(j,i)=Zero
          XCa(j,i)=Zero
          XMA(j,i)=Zero
          XMB(j,i)=Zero
        end do
      end do
C
      itmp2=0
      do i=1,n
         XA(i,i)=real(((i*(i-1))/2)+i)
        if(i.eq.(itmp2*68+1))then
          itmp2=itmp2+1
          do j=i+1,n
            l=((j*(j-1))/2)+i
            XA(j,i)=real(l)
            XA(i,j)=XA(j,i)
          end do
        endif
      end do
C
      do i=1,n
        do j=1,n
          XCa(j,i)=One/(Ten**4)
        end do
      end do
C
         call FDate(DayTim)
        write(*,"5X,A") DayTim
C
      Do i=1,25
        call XGEMM(1,'N','N',n,n,n,one,XCa,n,XA,n,one,XMA,n)
        call XGEMM(1,'N','T',n,n,n,one,XCa,n,XA,n,one,XMB,n)
      End do
C
         call FDate(DayTim)
        write(*,"5X,A") DayTim

where the main part of XGEMM is:

        XM = M
        XK = K
        XN = N
        XLDA = LDA
        XLDB = LDB
        XLDC = LDC
        MinCoW = 3
        ....
C$OMP Parallel
C$OMP Single
C$      Np=OMP_GET_NUM_THREADS()
C$OMP End Single
C$OMP End Parallel
        ColPW = Max((N+NP-1)/NP,MinCoW)
        NWork = (N+ColPW-1)/ColPW
        IncB = LDB*ColPW
        IncC = ColPW*LDC
C$OMP Parallel Do Default(Shared) Schedule(Static,1) Private(IP,XN)
          Do 100 IP = 0, (NWork-1)
            XN = Min(N-IP*ColPW,ColPW)
            Call DGEMM(XStr1,XStr2,XM,XN,XK,Alpha,A,XLDA,B(1+IP*IncB),
     $          XLDB,Beta,C(1+IP*IncC),XLDC)
  100       Continue

As we can see from the first part of the code, matrix XA is a very sparse matrix (>99%, the dimension of the matrices is 30003000). Since in DGEMM code there is a sparse test of matrix XA if the multiplication form is XCaXA or XCa*XA(T), I make both matrix multiplication in such form.

Then the problem arises: when I use the PGI 7.1.6 library to compile this code, to finish 25 times matrix multiplication, the time is about 635 seconds. But the speedup for parallelization is very good: for 8 processors I can get the speedup around 7.

However, when I compile this code without any library (with DGEMM code in my source code), the time used for 25 multiplications is about 55 seconds. But the speedup for parallelization is quite poor: no matter how many processors I have used, the speedup is never over 2.

I am very confused. I thought the library based DGEMM should be much faster than the code itself no matter whether the matrix XA is sparse or not. But now why is it slower? And why the non-library based DGEMM has such a poor speedup? Some one can help me on this? Thank you so much.

MatColgrove · November 21, 2008, 9:55pm

Hi Sharp,

Which DGEMM source and BLAS library are you using? Also, what optimizations are you using to compile and link?

PGI ships two BLAS libraries, “-lblas” and “-lacml”. ACML is AMD’s optimized math library while our generic BLAS library uses the NetLIB source (See: http://www.netlib.org/blas/dgemm.f) and is compiled with “-Kieee”. “-Kieee” disables some high level optimizations in order to adhere to the IEEE 754 standard.

Mat

Sharp · November 24, 2008, 12:59pm

Hi Mat,

I have used the following script to compile:

pgf77 -i8 -mp -O2 -tp p7-64 -Kieee -time -fast -o xgemm.exe xgemm.F -lacml

Where I have added the -Kieee you mentioned. However, compiling in this way, nothing has changed. To finish 25 times matrix multiplication it still takes around 630 seconds.

I use the same DGEMM code you showed me in my source code for the non-library based version. And still this version uses 55 seconds to finish 25 times matrix multiplication. But I assume since it is non-library based, the speedup for parallelization is quite poor.

I am still confused about this. Using library should be much better than using code only, right?

Sharp

MatColgrove · November 24, 2008, 5:48pm

Hi Sharp,

I am still confused about this. Using library should be much better than using code only, right?

Since ACML is AMD’s math kernel library and tuned for AMD chips, I would expect some performance penalty when running on an Intel based system. Though 25 times slower seems a bit much. You can try downloading a later version (See: HERE) to see if they have done more tuning for Intel chips. Otherwise, I would contact AMD.

I would be interested to see how the performance changes with Intel’s MKL or even with GOTOBlas.

Mat

Sharp · November 25, 2008, 4:00pm

mkcolg:

Hi Sharp,

You can try downloading a later version (See: > HERE> ) to see if they have done more tuning for Intel chips.

Hi Mat,

Since the program I am dealing with is not compiled by MKL or GOTOBlas, in order to simulate the program, I have to compile my stand-alone code by using the same kind of library–PGI libraries. I haven’t tried the MKL library yet. I have downloaded the latest acml library (version 4.2.0) and compiled my own code by using pgf77 (version 7.1.6). To finish 2 times of matrix multiplication the new library based code takes 63 seconds while the non-library based code takes less than 5 seconds. So the performance doesn’t change much.

Because one of the matrices used for multiplication is quite sparse, and I found in the DGEMM code there is a zero test on one matrix, i.e.:
C
C           Form  C := alpha*A*B + beta*C.
C
            DO 90, J = 1, N
               ........
               DO 80, L = 1, K
                  IF( B( L, J ).NE.ZERO )THEN
                     TEMP = ALPHA*B( L, J )
                     DO 70, I = 1, M
                        C( I, J ) = C( I, J ) + TEMP*A( I, L )
   70                CONTINUE
                  END IF
   80          CONTINUE
   90       CONTINUE
To use the DGEMM code directly, this zero test seems work quite well, but when the library is applied (no matter what version is used), it seems this zero test doesn’t work at all. However, it seems this zero test breaks the speedup for the OpenMP parallelization of the DGEMM code.

I really hope this problem can be solved. Thank you very much.

Sharp

MatColgrove · November 25, 2008, 6:19pm

Since we don’t control the ACML source, you’d need to contact AMD about the poor performance.

If you want to send your example program to trs@pgroup.com and ask PGI customer support to forward it to me, I can take a look at the lack of parallel speed-up. I can also run the code on an AMD system to see if the problem with ACML is general or specific to Core2.

Mat

Sharp · December 8, 2008, 4:31pm

Hi Mat, I wonder is there any updating about the library based DGEMM? Also, I have read through my non-library based DGEMM code for many times, I still don’t know why the speed up performance is so poor. Do you have any idea about this? Thank you very much.

Sharp

MatColgrove · December 8, 2008, 9:34pm

Hi Sharp,

I’ve looked at but nothing stands out. I had to put it aside for a bit but will try and look at it again later this week.

Mat

Sharp · December 16, 2008, 5:36pm

Hi Mat,

The speedup of the non-library based code is quite poor. I thought it is because of the compiler. So I tried 3 different compilers: pgf77, pgf90, and intel compiler ifort. to compile the same non-library based DGEMM code. The results are as follows:

ifort:

No. of CPU: 1        2        3        4        5        6        7        8
Speedup:    -     1.39      1.72      2.00     2.30     2.45     2.45     2.53

pgf77:

No. of CPU: 1        2        3        4        5        6        7        8
Speedup:    -     1.55      1.74      2.14    2.08     2.28     2.38    2.53

pgf90:

No. of CPU: 1        2        3        4        5        6        7        8
Speedup:    -     1.55      1.74     2.03      2.08     2.48     2.16      2.28

I still don’t understand what is the reason causes this poor speedup.

However, when I use library based DGEMM, although it takes 10 times more time to finish a matrix multiplication, the speedup is perfect (a linear relationship). It seems we have a trade-off here. Then I wonder is there a existing library can achieve both fast serial calculation and good speedup? Thank you very much.

Hope you have a very good Christmas and Happy New Year!

Sharp

MatColgrove · December 19, 2008, 11:21pm

Hi Sharp,

I profiled the code using oprofile based hardware counters (using the PGI utility pgcollect) and found that the “mp_barrier” routines accounted for 52% of the runtime. I then did an instrumented profile (by compiling with -Mprof=lines) and see that the threads spend a lot of time waiting for thread 0 to finish.

To help with this, instead of dividing the array up into NP chunks where NP is the total number of threads, I divided it up into smaller NP*16 chunks. This way the threads aren’t waiting and can do useful work if one thread hits a bottleneck (like memory). Profiling the modified code shows that “mp_barrier” is now less than 18% of the overall time.

On my 2 socket penryn system, the original code took 56 seconds with 1 thread and 26 seconds with. With the modified code the time dropped to 19 seconds. Not great, but better.

As for other libraries, you can try ATLAS or GOTOBlas. I’ve run out of time for today, otherwise I’d try them myself.

Mat

Sharp · December 22, 2008, 7:52pm

Hi Mat, I have already tried to reduce the size of the chunks. For example, I changed the line of setting the size of chunks from:

        ColPW = Max((N+NP-1)/NP,MinCoW)

to

          ColPW = Min((N+NP-1)/NP,(NP*MinCoW))

also, since the chunk size is smaller now, I changed the Schedule(Static,1) to Schedule(Dynamic,1).

However, the speedup for 2 processors is much better (close to 2, about 1.8 ish), but once the number of processors goes over 2, the speedup is still very poor, e.g. for 8 processors, the speedup is about 3.73. This is still not as good as the library based DGEMM. Since the reason for this seems not come from the compiler, it might be because my code is not efficient enough. Do you think I simply try out a better chunk size will work?

I also tried to parallel DGEMM, by using the same size of the chunks, directly without using the stub XGEMM. It shows similar results.

Since you said

the threads spend a lot of time waiting for thread 0 to finish

do you think if I put a Nowait at the end of the parallel region will work?

For other libraries, I will let you know the results after I have tired them. Thank you very much.

Hope you have a very good Christmas and New Year!

Sharp

MatColgrove · December 23, 2008, 12:14am

Hi Sharp,

Do you think I simply try out a better chunk size will work?

It helps but not by much. Unfortunately, after two hours of tinkering with things (like you I tried dynamic scheduling as well as guided, and nowait), reducing the chuck size was the only thing I saw that has some improvement.

One possibility is that the overhead of the thread creation and management is higher that the gain of the parallelization. Given that the total time of DGEMM is less than a second, it would make sense that the overhead would reduce the parallel speed-up. However when I increase the problem size, the speed-up didn’t change (assuming a fix overhead cost). So, I haven’t been able to prove this is the problem.

Hope you have a very good Christmas and New Year!

Thank you and I hope you also have a very good Christmas and New Year. Portland is currently under a foot of snow and ice so it definately a very white Christmas!

Mat

Sharp · January 12, 2009, 6:36pm

Hi Mat,

Happy New Year! I hope you had a very good holiday.

Now back to our problem we had before. I have tried to enlarge the sizes of the matrices too to see whether the performance changes (with NP*16 chunk and dynamic). But the result of speedup is about the same.

Also, I just found out that we don’t have ATLAS or GOTOBlas installed. So I can’t tell you the result of the performance. But I think it will be similar to ACML.

After trying many ways to solve the performance problem, it seems the best way is to use library. However, because the library based DGEMM performs slower than the non-library version (one of the matrices is sparse), I ran a search online to see whether there is a solution for this. Then I found there are some sparse BLAS routines. But it seems these sparse BLAS routines are supported by MKL libraries. I wonder is there any sparse BLAS routines that is supported by PGI libraries? Thank you very much.

Once again, Happy New Year!

Sharp

MatColgrove · January 12, 2009, 7:08pm

Hi Sharp,

I don’t believe that the NetLib BLAS or AMD ACML libraries that we ship with our product contain any Sparse routines. Though, you should be able to link PGI compiled code with the MKL libraries. There are also several free libraries available including one from the Trilinos product (http://trilinos.sandia.gov/index.html) which I’ve been told works well with the PGI compilers.

Mat

Topic		Replies	Views
Any one can tell me the reason for this performance problem? Legacy PGI Compilers	2	16734	March 27, 2009
How can I make the parallel work? Legacy PGI Compilers	3	4738	March 5, 2009
Why large scale DGEMM parallelization appears strange? Legacy PGI Compilers	2	16366	September 5, 2008
my speedy SGEMM CUDA Programming and Performance	91	276266	May 29, 2013
PGI Acc: Matrix-matrix-multiplication Legacy PGI Compilers	3	5217	September 10, 2010
Fortran with OpenMP almost no speedup Legacy PGI Compilers	15	12692	August 20, 2014
Slow matmul despite optimization flags used Legacy PGI Compilers	3	7422	January 10, 2006
Fast gemv() CUDA Programming and Performance	40	49459	March 22, 2010
Cuda SGEMM same speed as APPLE veclibs ? CUDA Programming and Performance	8	10687	May 8, 2008
Reasonable timing with Cublas dgemm and sgemm CUDA Programming and Performance	15	4376	January 14, 2010

Why library based DGEMM slower than DGEMM code itself?

Related topics