Matrix - Vector Multiplication Can't get any faster with shared memory

Zhou_En · September 6, 2011, 7:23am

Hi
I tried to write a kernel doing matrix-vector multiplication. It was a very short vector, only about 8 or 10 elements in it. But the matrix has huge number of rows, larger than 100k. What I have done was that I copied the whole vector into shared memory, each thread calculated dot product of a row and the vector.
Here is the kernel:

global void testKernel(Matrix m_d, Vector v_d, Vector mvProd_d, const int width)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
Real sum =0;
shared Real vec_s[WIDTH];
if(threadIdx.x<WIDTH)
vec_s[threadIdx.x] = v_d.elements[threadIdx.x];
__syncthreads();
#pragma unroll 8
for(int i=0; i<width; i++)
{
sum += m_d.elements[tid*width+i] * vec_s[i];
}
mvProd_d.elements[tid] = sum;
}

I have ran this code on a Tesla C2070 card, but the maximum speedup I got is 20 times. I couldn’t figure out how to improve my kernel any more. Even by increasing the length of the vector, my kernel just won’t go any faster. Could anyone help me please? Thank you.

pkgind · September 6, 2011, 8:30am

Can you try by putting Matrix Elements also in shared memory, I think that should increase performance.

Zhou_En · September 6, 2011, 9:04am

Thanks for the reply.

I’ve tried that. Actually it didn’t improve anything. Because during the calculation, matrix elements were only read once, but vector elements had to be read once for every row.

avidday · September 6, 2011, 9:10am

You might want to investigate having each thread compute more than one dot product. That might help amortize the overhead associated with the load of the vector from global memory. Also, if there is only a small range of width values your kernel will ever handle, try using C++ templates with width passed as a template parameter, rather than an argument. The compiler might be able to do a better job of optimization when the width is known at compile time. A more extreme “hand optimization” might be to eliminate shared memory and have each thread hold the vector in registers (given that there are only 8 or 10 values). On Fermi, there is about 8x higher register bandwidth than shared memory bandwidth.

EDIT: Also noticed that the global memory reads from the matrix are not coalesced. There is a lot of potential performance benefit if you restructure the code so that the reads are coalesced.

Zhou_En · September 6, 2011, 1:16pm

You might want to investigate having each thread compute more than one dot product. That might help amortize the overhead associated with the load of the vector from global memory. Also, if there is only a small range of width values your kernel will ever handle, try using C++ templates with width passed as a template parameter, rather than an argument. The compiler might be able to do a better job of optimization when the width is known at compile time. A more extreme “hand optimization” might be to eliminate shared memory and have each thread hold the vector in registers (given that there are only 8 or 10 values). On Fermi, there is about 8x higher register bandwidth than shared memory bandwidth.

EDIT: Also noticed that the global memory reads from the matrix are not coalesced. There is a lot of potential performance benefit if you restructure the code so that the reads are coalesced.

Thank you very much avidday! It’s a very valuable reply. I will start to relook at my code now.

Topic		Replies	Views
Vector-Matrix Multiplication Is this a fast kernel? CUDA Programming and Performance	5	6656	April 19, 2010
Shared vs Global Memory impl. of vector matrix mulltiplication CUDA Programming and Performance	3	10676	February 8, 2008
Vector matrix multiplication CUDA Programming and Performance	5	6116	November 30, 2011
Fastest matrix-vector multiplication? CUDA Programming and Performance	24	4023	May 21, 2011
Advice - Complex Matrix-Vector Multiplication CUDA Programming and Performance	3	5626	May 12, 2009
using shared memory CUDA Programming and Performance	6	2931	September 17, 2009
Optimization of kernel for batch convolution of many small matrices CUDA Programming and Performance	4	1744	August 1, 2013
2D float matrix x vector: global vs. shared memory: CUDA Programming and Performance	1	548	October 1, 2018
SPMV and precaching improving speed of matrix vector multiplication CUDA Programming and Performance	6	952	April 5, 2011
best possible matrix-vector multiplication performance? poor guy with only an emulator wonders about CUDA Programming and Performance	6	5603	August 12, 2009

Matrix - Vector Multiplication Can't get any faster with shared memory

Related topics