I want to do matrix-vector multiplication.

I have 32 S(512*512) (matrix) and 32 A(512) (vector). I want to do 32 multiplications at one time

I use for loop.

```
for(k=0;k<32;k++){
cublasZgemv(handle,CUBLAS_OP_N,512,512,&alpha,d_S+k*512*512,NRM,d_A+512*k,1,&belta,d_B+512*k,1);
}
```

It takes 8.99ms. I do cublasZgemv 32 times. That was stupid.

Another way I write my own kernel to do that.

```
__global__ void mv(cuDoubleComplex *S,cuDoubleComplex *A ,cuDoubleComplex *B, int n, int l)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
int j = blockIdx.y*blockDim.y + threadIdx.y;
cuDoubleComplex Z;
cuDoubleComplex Temp;
for(int k = 0;k<n;k++){
Temp = cuCmul(S[j*n*n+i*n+k],A[j*n+k]);
Z = cuCadd(Z,Temp);
}
B[j*n+i] = Z;
}
```

It takes me 3.8ms. The summation is very slow. But that’s best I could do.

Both of them give me correct result. But I can’t satisfy that speed!!!.

Even using MKL + OpneMP is faster than that.

cublasZgemv is very fast but I want to do 32 cublasZgemv at one time.(or can parallel that)