Vector-Matrix Multiplication Is this a fast kernel?

Is this a fast Vector-Matrix Multiplication kernel?

What would you change to make it faster?

What kind of problems can it have?

__global__

void vectorMatrixMultiplicationKernel(float* v1, unsigned v1_size, float* M, float* v2, unsigned v2_size)

{

	extern __shared__ float sdata[];

	unsigned v1_pos = threadIdx.x;

	while (v1_pos < v1_size){

		sdata[v1_pos] = v1[v1_pos];

		v1_pos += blockDim.x;

	}

	__syncthreads();

	unsigned v2_pos = blockIdx.x*blockDim.x + threadIdx.x;

	float result = 0;

	if (v2_pos < v2_size){

		for (unsigned i=0; i < v1_size; i++){

			result += sdata[i] * M[v2_pos + i];

		}

		v2[v2_pos] = result;

	}

}

Thanks

I very much doubt it, but it is your code - benchmark it an see for yourself. CUBLAS includes a sgemv implementation which you can use for comparison.

You might be interested in this thread

The fact you are loading the whole rhs vector into shared memory is a pretty major restriction on totall problem size. Also the lhs matrix memory access patterns don’t look like they will coalesce, which will have a major effect on performance.

I very much doubt it, but it is your code - benchmark it an see for yourself. CUBLAS includes a sgemv implementation which you can use for comparison.

You might be interested in this thread

The fact you are loading the whole rhs vector into shared memory is a pretty major restriction on totall problem size. Also the lhs matrix memory access patterns don’t look like they will coalesce, which will have a major effect on performance.

Is it a vector-matrix multiplication at all? I think you dropped part of the index calculation for the matrix:

result += sdata[i] * M[v2_pos * v1_size + i];

Is it a vector-matrix multiplication at all? I think you dropped part of the index calculation for the matrix:

result += sdata[i] * M[v2_pos * v1_size + i];

tera

Yes, you’re right. althougt I use

result += sdata[i] * M[v2_pos + (v2_size * i)];

instead of

result += sdata[i] * M[v2_pos * v1_size + i];

I think this access is more coalescing. Using it, I need the Matrix to be trasposed.

avidday

Thank you, that thread was very interesting to me.

I’m using it for neural nets, but I calculate “r= A1x1 + A2x2…+Anxn - b” instead of “r=Ax - b”. So the kernels I have calculate “r = r + Ax”. And in the beginning I got another kernel that calculates “r = -b”. The sizes of r (output) and x (input) can be different. I just thought “r = Ax” was a more general kernel to discuss.

One of my kernels to calculate “r = r + Ax” is pretty similar to yours. This one needs the matrix to be transposed but since I call the kernel a lot of times, I can tranpose it just once and keep it in memory this way. So I’m not counting the transposing time.

In the kernel that is similar to yours i have changed

float partial_sum = 0.0f;   

	for(int idx = threadIdx.x; idx < N_; idx += blockDim.x) {

		float Aval   = A[rowIdx+idx];

		float xval   = x[idx];

		partial_sum  += Aval * xval; 

	}

for

float partial_sum = 0.0f;   

	for(int idx = threadIdx.x; idx < N_; idx += blockDim.x) {

		partial_sum  += A[rowIdx+idx] * x[idx]; 

	}

and i get a little improvement, but I don’t understand it.

This new kernel is faster than the one equivalent to yours (note that I don’t count the time I spend tranposing the matrix), but, as you said, the size of the input (x) is limited by the size of the shared memory. However, I can make partitions of x and At (tranposed A) easily and call the kernel how many times as neccesary.

Thank you very much, I thougt that I was wasting my time and my kernels were bad since I was getting better times with an assembly (with SSE2, XMM) than with CUDA. Now I know that my card (a 9500GT) is the one I have to blame for that.

Unfortunately I’m not using the nvidia tools to measure performance and I use just the clock since I’m used to it.