Many matrix-vector multiplications at one time

I want to do matrix-vector multiplication.

I have 32 S(512*512) (matrix) and 32 A(512) (vector). I want to do 32 multiplications at one time

I use for loop.


It takes 8.99ms. I do cublasZgemv 32 times. That was stupid.

Another way I write my own kernel to do that.

__global__ void mv(cuDoubleComplex *S,cuDoubleComplex *A ,cuDoubleComplex *B, int n, int l)
     int i = blockIdx.x*blockDim.x + threadIdx.x;
     int j = blockIdx.y*blockDim.y + threadIdx.y;
     cuDoubleComplex Z;
     cuDoubleComplex Temp;
     for(int k = 0;k<n;k++){
          Temp = cuCmul(S[j*n*n+i*n+k],A[j*n+k]);
          Z = cuCadd(Z,Temp);
     B[j*n+i] = Z;

It takes me 3.8ms. The summation is very slow. But that’s best I could do.

Both of them give me correct result. But I can’t satisfy that speed!!!.
Even using MKL + OpneMP is faster than that.

cublasZgemv is very fast but I want to do 32 cublasZgemv at one time.(or can parallel that)

Would the streams work?

Streams work!!

It only takes 1.58ms!!!