Is this a fast Vector-Matrix Multiplication kernel?

What would you change to make it faster?

What kind of problems can it have?

```
__global__
void vectorMatrixMultiplicationKernel(float* v1, unsigned v1_size, float* M, float* v2, unsigned v2_size)
{
extern __shared__ float sdata[];
unsigned v1_pos = threadIdx.x;
while (v1_pos < v1_size){
sdata[v1_pos] = v1[v1_pos];
v1_pos += blockDim.x;
}
__syncthreads();
unsigned v2_pos = blockIdx.x*blockDim.x + threadIdx.x;
float result = 0;
if (v2_pos < v2_size){
for (unsigned i=0; i < v1_size; i++){
result += sdata[i] * M[v2_pos + i];
}
v2[v2_pos] = result;
}
}
```

Thanks

I think you should post this question on the CUDA Programming and Development forum. You get a better response there!

