Why library based DGEMM slower than DGEMM code itself?

Hi, I have developed the following code to calculate matrix multiplication.

      do i=1,n
        do j=1,n
          XA(j,i)=Zero
          XCa(j,i)=Zero
          XMA(j,i)=Zero
          XMB(j,i)=Zero
        end do
      end do
C
      itmp2=0
      do i=1,n
         XA(i,i)=real(((i*(i-1))/2)+i)
        if(i.eq.(itmp2*68+1))then
          itmp2=itmp2+1
          do j=i+1,n
            l=((j*(j-1))/2)+i
            XA(j,i)=real(l)
            XA(i,j)=XA(j,i)
          end do
        endif
      end do
C
      do i=1,n
        do j=1,n
          XCa(j,i)=One/(Ten**4)
        end do
      end do
C
         call FDate(DayTim)
        write(*,"5X,A") DayTim
C
      Do i=1,25
        call XGEMM(1,'N','N',n,n,n,one,XCa,n,XA,n,one,XMA,n)
        call XGEMM(1,'N','T',n,n,n,one,XCa,n,XA,n,one,XMB,n)
      End do
C
         call FDate(DayTim)
        write(*,"5X,A") DayTim

where the main part of XGEMM is:

        XM = M
        XK = K
        XN = N
        XLDA = LDA
        XLDB = LDB
        XLDC = LDC
        MinCoW = 3
        ....
C$OMP Parallel
C$OMP Single
C$      Np=OMP_GET_NUM_THREADS()
C$OMP End Single
C$OMP End Parallel
        ColPW = Max((N+NP-1)/NP,MinCoW)
        NWork = (N+ColPW-1)/ColPW
        IncB = LDB*ColPW
        IncC = ColPW*LDC
C$OMP Parallel Do Default(Shared) Schedule(Static,1) Private(IP,XN)
          Do 100 IP = 0, (NWork-1)
            XN = Min(N-IP*ColPW,ColPW)
            Call DGEMM(XStr1,XStr2,XM,XN,XK,Alpha,A,XLDA,B(1+IP*IncB),
     $          XLDB,Beta,C(1+IP*IncC),XLDC)
  100       Continue

As we can see from the first part of the code, matrix XA is a very sparse matrix (>99%, the dimension of the matrices is 30003000). Since in DGEMM code there is a sparse test of matrix XA if the multiplication form is XCaXA or XCa*XA(T), I make both matrix multiplication in such form.

Then the problem arises: when I use the PGI 7.1.6 library to compile this code, to finish 25 times matrix multiplication, the time is about 635 seconds. But the speedup for parallelization is very good: for 8 processors I can get the speedup around 7.

However, when I compile this code without any library (with DGEMM code in my source code), the time used for 25 multiplications is about 55 seconds. But the speedup for parallelization is quite poor: no matter how many processors I have used, the speedup is never over 2.

I am very confused. I thought the library based DGEMM should be much faster than the code itself no matter whether the matrix XA is sparse or not. But now why is it slower? And why the non-library based DGEMM has such a poor speedup? Some one can help me on this? Thank you so much.

Hi Sharp,

Which DGEMM source and BLAS library are you using? Also, what optimizations are you using to compile and link?

PGI ships two BLAS libraries, “-lblas” and “-lacml”. ACML is AMD’s optimized math library while our generic BLAS library uses the NetLIB source (See: http://www.netlib.org/blas/dgemm.f) and is compiled with “-Kieee”. “-Kieee” disables some high level optimizations in order to adhere to the IEEE 754 standard.

  • Mat

Hi Mat,

I have used the following script to compile:

pgf77 -i8 -mp -O2 -tp p7-64 -Kieee -time -fast -o xgemm.exe xgemm.F -lacml

Where I have added the -Kieee you mentioned. However, compiling in this way, nothing has changed. To finish 25 times matrix multiplication it still takes around 630 seconds.

I use the same DGEMM code you showed me in my source code for the non-library based version. And still this version uses 55 seconds to finish 25 times matrix multiplication. But I assume since it is non-library based, the speedup for parallelization is quite poor.

I am still confused about this. Using library should be much better than using code only, right?

Sharp

Hi Sharp,

I am still confused about this. Using library should be much better than using code only, right?

Since ACML is AMD’s math kernel library and tuned for AMD chips, I would expect some performance penalty when running on an Intel based system. Though 25 times slower seems a bit much. You can try downloading a later version (See: HERE) to see if they have done more tuning for Intel chips. Otherwise, I would contact AMD.

I would be interested to see how the performance changes with Intel’s MKL or even with GOTOBlas.

  • Mat

Since we don’t control the ACML source, you’d need to contact AMD about the poor performance.

If you want to send your example program to trs@pgroup.com and ask PGI customer support to forward it to me, I can take a look at the lack of parallel speed-up. I can also run the code on an AMD system to see if the problem with ACML is general or specific to Core2.

  • Mat

Hi Mat, I wonder is there any updating about the library based DGEMM? Also, I have read through my non-library based DGEMM code for many times, I still don’t know why the speed up performance is so poor. Do you have any idea about this? Thank you very much.

Sharp

Hi Sharp,

I’ve looked at but nothing stands out. I had to put it aside for a bit but will try and look at it again later this week.

  • Mat

Hi Mat,

The speedup of the non-library based code is quite poor. I thought it is because of the compiler. So I tried 3 different compilers: pgf77, pgf90, and intel compiler ifort. to compile the same non-library based DGEMM code. The results are as follows:

ifort:

No. of CPU: 1        2        3        4        5        6        7        8
Speedup:    -     1.39      1.72      2.00     2.30     2.45     2.45     2.53

pgf77:

No. of CPU: 1        2        3        4        5        6        7        8
Speedup:    -     1.55      1.74      2.14    2.08     2.28     2.38    2.53

pgf90:

No. of CPU: 1        2        3        4        5        6        7        8
Speedup:    -     1.55      1.74     2.03      2.08     2.48     2.16      2.28

I still don’t understand what is the reason causes this poor speedup.

However, when I use library based DGEMM, although it takes 10 times more time to finish a matrix multiplication, the speedup is perfect (a linear relationship). It seems we have a trade-off here. Then I wonder is there a existing library can achieve both fast serial calculation and good speedup? Thank you very much.

Hope you have a very good Christmas and Happy New Year!

Sharp

Hi Sharp,

I profiled the code using oprofile based hardware counters (using the PGI utility pgcollect) and found that the “mp_barrier” routines accounted for 52% of the runtime. I then did an instrumented profile (by compiling with -Mprof=lines) and see that the threads spend a lot of time waiting for thread 0 to finish.

To help with this, instead of dividing the array up into NP chunks where NP is the total number of threads, I divided it up into smaller NP*16 chunks. This way the threads aren’t waiting and can do useful work if one thread hits a bottleneck (like memory). Profiling the modified code shows that “mp_barrier” is now less than 18% of the overall time.

On my 2 socket penryn system, the original code took 56 seconds with 1 thread and 26 seconds with. With the modified code the time dropped to 19 seconds. Not great, but better.

As for other libraries, you can try ATLAS or GOTOBlas. I’ve run out of time for today, otherwise I’d try them myself.

  • Mat

Hi Mat, I have already tried to reduce the size of the chunks. For example, I changed the line of setting the size of chunks from:

        ColPW = Max((N+NP-1)/NP,MinCoW)

to

          ColPW = Min((N+NP-1)/NP,(NP*MinCoW))

also, since the chunk size is smaller now, I changed the Schedule(Static,1) to Schedule(Dynamic,1).

However, the speedup for 2 processors is much better (close to 2, about 1.8 ish), but once the number of processors goes over 2, the speedup is still very poor, e.g. for 8 processors, the speedup is about 3.73. This is still not as good as the library based DGEMM. Since the reason for this seems not come from the compiler, it might be because my code is not efficient enough. Do you think I simply try out a better chunk size will work?

I also tried to parallel DGEMM, by using the same size of the chunks, directly without using the stub XGEMM. It shows similar results.

Since you said

the threads spend a lot of time waiting for thread 0 to finish

do you think if I put a Nowait at the end of the parallel region will work?

For other libraries, I will let you know the results after I have tired them. Thank you very much.

Hope you have a very good Christmas and New Year!

Sharp

Hi Sharp,

Do you think I simply try out a better chunk size will work?

It helps but not by much. Unfortunately, after two hours of tinkering with things (like you I tried dynamic scheduling as well as guided, and nowait), reducing the chuck size was the only thing I saw that has some improvement.

One possibility is that the overhead of the thread creation and management is higher that the gain of the parallelization. Given that the total time of DGEMM is less than a second, it would make sense that the overhead would reduce the parallel speed-up. However when I increase the problem size, the speed-up didn’t change (assuming a fix overhead cost). So, I haven’t been able to prove this is the problem.

Hope you have a very good Christmas and New Year!

Thank you and I hope you also have a very good Christmas and New Year. Portland is currently under a foot of snow and ice so it definately a very white Christmas!

  • Mat

Hi Mat,

Happy New Year! I hope you had a very good holiday.

Now back to our problem we had before. I have tried to enlarge the sizes of the matrices too to see whether the performance changes (with NP*16 chunk and dynamic). But the result of speedup is about the same.

Also, I just found out that we don’t have ATLAS or GOTOBlas installed. So I can’t tell you the result of the performance. But I think it will be similar to ACML.

After trying many ways to solve the performance problem, it seems the best way is to use library. However, because the library based DGEMM performs slower than the non-library version (one of the matrices is sparse), I ran a search online to see whether there is a solution for this. Then I found there are some sparse BLAS routines. But it seems these sparse BLAS routines are supported by MKL libraries. I wonder is there any sparse BLAS routines that is supported by PGI libraries? Thank you very much.

Once again, Happy New Year!

Sharp

Hi Sharp,

I don’t believe that the NetLib BLAS or AMD ACML libraries that we ship with our product contain any Sparse routines. Though, you should be able to link PGI compiled code with the MKL libraries. There are also several free libraries available including one from the Trilinos product (http://trilinos.sandia.gov/index.html) which I’ve been told works well with the PGI compilers.

  • Mat