Why large scale DGEMM parallelization appears strange?

Hi, I am working on a program that using DGEMM for matrix multiplication. The compiler I am using is pgi 707/pgf77. In this program the subroutine DGEMM has been parallelized already:

C$OMP Parallel
C$OMP Single
C$      NP=omp_get_num_threads()
C$      MinCoW=16
C$OMP End Single
C$OMP End Parallel
          ColPW = Max((N+NP-1)/NP,MinCoW)
          NWork = (N+ColPW-1)/ColPW        [i]!...N is the number of column of C(M,N).[/i]
          If(XStr2.eq.'T'.or.XStr2.eq.'C') then
            IncB = 1
              IncB = LDB
           IncB = IncB*ColPW
           IncC = ColPW*LDC
C$OMP Parallel Do Default(Shared) Schedule(Static,1) Private(IP,XN)
          Do 100 IP = 0, (NWork-1)
              XN = Min(N-IP*ColPW,ColPW)
              Call DGEMM(XStr1,XStr2,XM,XN,XK,Alpha,A,XLDA,B(1+IP*IncB),
     $          XLDB,Beta,C(1+IP*IncC),XLDC)
100      Continue

The BLAS library I use for compiling this code is:

pgf77 -i8 ‘-mcmodel=medium’ -mp -O2 -tp p7-64 -Mreentrant -Mrecursive -Mnosave -Minfo -Mneginfo -time -fast -Munroll -Mvect=assoc,recog,cachesize:2097152 -o xgemm.exe xgemm.o $gdvroot/bsd/libf77blas-em64t.a $gdvroot/bsd/libatlas-em64t.a -lpthread -lm -lc

Now the problem is:when I run the matrix multiplication jobs (the size of the matrices is 3432X3432) parallelized, upto 7 processors the speedup is perfect, but once the jobs are parallelized by 8 processors, the speedup becomes really poor (less than 3 times). However, when I change the size of the matrices, e.g. 924X924, the speedup for 8 processors becomes normal. I tried to assemble more memory for the 3432X3432 matrix multiplication of 8 processors, but it seems the speedup for a 10GB memory (the limit of our hardware) is still the same. Any one here can help me? Thank you very much!!!


Did you try with our latest release? Can you please try and let us know if there is still a problem. There might be performance bug in our Openmp runtime that gets fixed in latest release.


Hi, thank you for your advice. Since our group doesn’t have license of using the latest 7.2.x version, I tried the library of 7.1.6. It works alright now. Thank you.