Matrix - Vector Multiplication Can't get any faster with shared memory

I tried to write a kernel doing matrix-vector multiplication. It was a very short vector, only about 8 or 10 elements in it. But the matrix has huge number of rows, larger than 100k. What I have done was that I copied the whole vector into shared memory, each thread calculated dot product of a row and the vector.
Here is the kernel:

global void testKernel(Matrix m_d, Vector v_d, Vector mvProd_d, const int width)
int tid = blockIdx.x * blockDim.x + threadIdx.x;
Real sum =0;
shared Real vec_s[WIDTH];
vec_s[threadIdx.x] = v_d.elements[threadIdx.x];
#pragma unroll 8
for(int i=0; i<width; i++)
sum += m_d.elements[tid*width+i] * vec_s[i];
mvProd_d.elements[tid] = sum;

I have ran this code on a Tesla C2070 card, but the maximum speedup I got is 20 times. I couldn’t figure out how to improve my kernel any more. Even by increasing the length of the vector, my kernel just won’t go any faster. Could anyone help me please? Thank you.

Can you try by putting Matrix Elements also in shared memory, I think that should increase performance.

Thanks for the reply.

I’ve tried that. Actually it didn’t improve anything. Because during the calculation, matrix elements were only read once, but vector elements had to be read once for every row.

You might want to investigate having each thread compute more than one dot product. That might help amortize the overhead associated with the load of the vector from global memory. Also, if there is only a small range of width values your kernel will ever handle, try using C++ templates with width passed as a template parameter, rather than an argument. The compiler might be able to do a better job of optimization when the width is known at compile time. A more extreme “hand optimization” might be to eliminate shared memory and have each thread hold the vector in registers (given that there are only 8 or 10 values). On Fermi, there is about 8x higher register bandwidth than shared memory bandwidth.

EDIT: Also noticed that the global memory reads from the matrix are not coalesced. There is a lot of potential performance benefit if you restructure the code so that the reads are coalesced.

Thank you very much avidday! It’s a very valuable reply. I will start to relook at my code now.