Hi, I have developed the following code to calculate matrix multiplication.

```
do i=1,n
do j=1,n
XA(j,i)=Zero
XCa(j,i)=Zero
XMA(j,i)=Zero
XMB(j,i)=Zero
end do
end do
C
itmp2=0
do i=1,n
XA(i,i)=real(((i*(i-1))/2)+i)
if(i.eq.(itmp2*68+1))then
itmp2=itmp2+1
do j=i+1,n
l=((j*(j-1))/2)+i
XA(j,i)=real(l)
XA(i,j)=XA(j,i)
end do
endif
end do
C
do i=1,n
do j=1,n
XCa(j,i)=One/(Ten**4)
end do
end do
C
call FDate(DayTim)
write(*,"5X,A") DayTim
C
Do i=1,25
call XGEMM(1,'N','N',n,n,n,one,XCa,n,XA,n,one,XMA,n)
call XGEMM(1,'N','T',n,n,n,one,XCa,n,XA,n,one,XMB,n)
End do
C
call FDate(DayTim)
write(*,"5X,A") DayTim
```

where the main part of XGEMM is:

```
XM = M
XK = K
XN = N
XLDA = LDA
XLDB = LDB
XLDC = LDC
MinCoW = 3
....
C$OMP Parallel
C$OMP Single
C$ Np=OMP_GET_NUM_THREADS()
C$OMP End Single
C$OMP End Parallel
ColPW = Max((N+NP-1)/NP,MinCoW)
NWork = (N+ColPW-1)/ColPW
IncB = LDB*ColPW
IncC = ColPW*LDC
C$OMP Parallel Do Default(Shared) Schedule(Static,1) Private(IP,XN)
Do 100 IP = 0, (NWork-1)
XN = Min(N-IP*ColPW,ColPW)
Call DGEMM(XStr1,XStr2,XM,XN,XK,Alpha,A,XLDA,B(1+IP*IncB),
$ XLDB,Beta,C(1+IP*IncC),XLDC)
100 Continue
```

As we can see from the first part of the code, matrix XA is a very sparse matrix (>99%, the dimension of the matrices is 3000*3000). Since in DGEMM code there is a sparse test of matrix XA if the multiplication form is XCa*XA or XCa*XA(T), I make both matrix multiplication in such form.

Then the problem arises: when I use the **PGI 7.1.6 library** to compile this code, to **finish 25 times matrix multiplication**, the time is about **635 seconds**. But the speedup for parallelization is very good: for 8 processors I can get the speedup around 7.

However, when I compile this code **without any library** (with DGEMM code in my source code), the time used for **25 multiplications** is about **55 seconds**. But the speedup for parallelization is quite poor: no matter how many processors I have used, the speedup is never over 2.

I am very confused. I thought the library based DGEMM should be much faster than the code itself no matter whether the matrix XA is sparse or not. But now why is it slower? And why the non-library based DGEMM has such a poor speedup? Some one can help me on this? Thank you so much.