I tried to write a kernel doing matrix-vector multiplication. It was a very short vector, only about 8 or 10 elements in it. But the matrix has huge number of rows, larger than 100k. What I have done was that I copied the whole vector into shared memory, each thread calculated dot product of a row and the vector.
Here is the kernel:
global void testKernel(Matrix m_d, Vector v_d, Vector mvProd_d, const int width)
int tid = blockIdx.x * blockDim.x + threadIdx.x;
Real sum =0;
shared Real vec_s[WIDTH];
vec_s[threadIdx.x] = v_d.elements[threadIdx.x];
#pragma unroll 8
for(int i=0; i<width; i++)
sum += m_d.elements[tid*width+i] * vec_s[i];
mvProd_d.elements[tid] = sum;
I have ran this code on a Tesla C2070 card, but the maximum speedup I got is 20 times. I couldn’t figure out how to improve my kernel any more. Even by increasing the length of the vector, my kernel just won’t go any faster. Could anyone help me please? Thank you.