Vector-Matrix Multiplication is this a fast kernel?

Is this a fast Vector-Matrix Multiplication kernel?

What would you change to make it faster?

What kind of problems can it have?

__global__

void vectorMatrixMultiplicationKernel(float* v1, unsigned v1_size, float* M, float* v2, unsigned v2_size)

{

	extern __shared__ float sdata[];

	unsigned v1_pos = threadIdx.x;

	while (v1_pos < v1_size){

		sdata[v1_pos] = v1[v1_pos];

		v1_pos += blockDim.x;

	}

	__syncthreads();

	unsigned v2_pos = blockIdx.x*blockDim.x + threadIdx.x;

	float result = 0;

	if (v2_pos < v2_size){

		for (unsigned i=0; i < v1_size; i++){

			result += sdata[i] * M[v2_pos + i];

		}

		v2[v2_pos] = result;

	}

}

Thanks

I think you should post this question on the CUDA Programming and Development forum. You get a better response there!

I think you should post this question on the CUDA Programming and Development forum. You get a better response there!

I’ll do that, thank you.

I’ll do that, thank you.

An administrator can remove this entry because I copied it here

An administrator can remove this entry because I copied it here