Why large scale DGEMM parallelization appears strange?

Sharp · August 28, 2008, 5:38pm

Hi, I am working on a program that using DGEMM for matrix multiplication. The compiler I am using is pgi 707/pgf77. In this program the subroutine DGEMM has been parallelized already:

C$OMP Parallel
C$OMP Single
C$      NP=omp_get_num_threads()
C$      MinCoW=16
C$OMP End Single
C$OMP End Parallel
          ColPW = Max((N+NP-1)/NP,MinCoW)
          NWork = (N+ColPW-1)/ColPW        [i]!...N is the number of column of C(M,N).[/i]
          If(XStr2.eq.'T'.or.XStr2.eq.'C') then
            IncB = 1
           else
              IncB = LDB
            endIf
           IncB = IncB*ColPW
           IncC = ColPW*LDC
C$OMP Parallel Do Default(Shared) Schedule(Static,1) Private(IP,XN)
          Do 100 IP = 0, (NWork-1)
              XN = Min(N-IP*ColPW,ColPW)
              Call DGEMM(XStr1,XStr2,XM,XN,XK,Alpha,A,XLDA,B(1+IP*IncB),
     $          XLDB,Beta,C(1+IP*IncC),XLDC)
100      Continue

The BLAS library I use for compiling this code is:

pgf77 -i8 ‘-mcmodel=medium’ -mp -O2 -tp p7-64 -Mreentrant -Mrecursive -Mnosave -Minfo -Mneginfo -time -fast -Munroll -Mvect=assoc,recog,cachesize:2097152 -o xgemm.exe xgemm.o $gdvroot/bsd/libf77blas-em64t.a $gdvroot/bsd/libatlas-em64t.a -lpthread -lm -lc

Now the problem is:when I run the matrix multiplication jobs (the size of the matrices is 3432X3432) parallelized, upto 7 processors the speedup is perfect, but once the jobs are parallelized by 8 processors, the speedup becomes really poor (less than 3 times). However, when I change the size of the matrices, e.g. 924X924, the speedup for 8 processors becomes normal. I tried to assemble more memory for the 3432X3432 matrix multiplication of 8 processors, but it seems the speedup for a 10GB memory (the limit of our hardware) is still the same. Any one here can help me? Thank you very much!!!

hongyon · August 28, 2008, 5:44pm

Hi,

Did you try with our latest release? Can you please try and let us know if there is still a problem. There might be performance bug in our Openmp runtime that gets fixed in latest release.

Hongyon

Sharp · September 5, 2008, 11:06am

Hi, thank you for your advice. Since our group doesn’t have license of using the latest 7.2.x version, I tried the library of 7.1.6. It works alright now. Thank you.

Sharp