Many matrix-vector multiplications at one time

I want to do matrix-vector multiplication.

I have 32 S(512*512) (matrix) and 32 A(512) (vector). I want to do 32 multiplications at one time

I use for loop.

for(k=0;k<32;k++){
     cublasZgemv(handle,CUBLAS_OP_N,512,512,&alpha,d_S+k*512*512,NRM,d_A+512*k,1,&belta,d_B+512*k,1);
}

It takes 8.99ms. I do cublasZgemv 32 times. That was stupid.

Another way I write my own kernel to do that.

__global__ void mv(cuDoubleComplex *S,cuDoubleComplex *A ,cuDoubleComplex *B, int n, int l)
{
     int i = blockIdx.x*blockDim.x + threadIdx.x;
     int j = blockIdx.y*blockDim.y + threadIdx.y;
     cuDoubleComplex Z;
     cuDoubleComplex Temp;
     for(int k = 0;k<n;k++){
          Temp = cuCmul(S[j*n*n+i*n+k],A[j*n+k]);
          Z = cuCadd(Z,Temp);
     }
     B[j*n+i] = Z;
}

It takes me 3.8ms. The summation is very slow. But that’s best I could do.

Both of them give me correct result. But I can’t satisfy that speed!!!.
Even using MKL + OpneMP is faster than that.

cublasZgemv is very fast but I want to do 32 cublasZgemv at one time.(or can parallel that)

Would the streams work?

Thank!!!
Streams work!!

It only takes 1.58ms!!!