I have two types of calculation:
1, Matrix(n,n)=Matrix(n,n)*Vector(n)
2, Matrix(n,n)=Matrix(n,n)*matrix(3,m)

here is the typical code: we assume A and B are two nn matrices, C is 3m matrix, X is a vector.
Type 1:

Do i=1,n
Temp=X(i)
Do j=1,n
A(j,i)=A(j.i)+Temp*B(j,i)
End Do
End Do

Type 2:

Do i=1,m
L1=int(C(1,i))
L2=int(C(2,i))
Temp=C(2,i)
Do j=1,n
A(j,L1)=A(j,L1)+Temp*B(j,L2)
A(j,L2)=A(j,L2)+Temp*B(j,L1)
End Do
End Do

If I write these two types of code by myself, the efficiency is not as good as I expected. So I wonder is there any kind of BLAS routines can carry out these two kinds of calculations? Thank you very much.

The way I tested the efficiency of my code is through number of floating point operations per second.

The code I posted above is another version that does exactly the same work which the non-lib based DGEMM code I used to send you. The only difference between these two versions is the new version (let’s call it vectorized method) does not use matrix multiplication routine because one of the matrices used for matrix multiplication was very sparse. The vectorized version uses a list that stores all the non-zero elements sequentially. And this list is used to multiply the dense matrix.

Theoretically, sine this vectorized method deals with the non-zero elements only and the core calculation (DAXPY calculation) is the same with the DGEMM method, it should be a bit faster than the non-lib based DGEMM method. However, after I added the following code to count the floating point operation for both methods, I found the vectorized method is not efficiency at all: the non-lib based DGEMM method is about 537MFlops though one of the matrices is very sparse, and the vectorized method is about 494MFlops. If rising the size of the matrices, this gap becomes larger. This is why I was looking for some BLAS routine to carry out those types of calculation. I think maybe the BLAS routines can use the computer memory system very well.

Here is the code I added to count the floating pint operation, but I think every one knows how to do this. So I just take the type 1 as an example.

Icont=0
call system_clock(t1)
Do i=1,n
Temp=X(i)
Do j=1,n
Icont=Icont+1
A(j,i)=A(j,i)+Temp*B(j,i)
End Do
End Do
call system_clock(t2,rate)
time = real(t2-t1)/rate
write(*,*)' no of floating point operations:',Icont
write(*,*)' elapsed',time
write(*,*)' MFlops:', real(Icont)/time