SpMV implementation problem

vvotan · April 14, 2009, 7:38pm

I’m trying to implement the sparse vector-matrix multiplication routine as described in “Effcient Sparse Matrix-Vector Multiplication on CUDA” by Bell Garlandy (spmv_csr_vector_kernel function). The source code is pretty similar to that given in the article and it seems to work correctly but it is painfully slow(more than 3 times slower than simple scalar implementation). I wonder, what can be the reason of it? I use GeForce 8800 GT for testing.

the code:

#define WARP_SZ 32

	__shared__ float vals [THREAD_BLOCK_SIZE*THREAD_BLOCK_SIZE];

	int thread_id = compute_thread_index(); // global thread index

	int warp_id = thread_id / WARP_SZ; // global warp index

	int lane = thread_id & (WARP_SZ - 1); // thread index within the warp

	// one warp per row

	int row = warp_id;

	const int thread_block_id = threadIdx.x + THREAD_BLOCK_SIZE*threadIdx.y;

	if ( row < size_vec)

	{

		uint2 rowptr_bounds = rowptr[row];

		const int row_start	=	rowptr_bounds.x;

		const int row_end	=	rowptr_bounds.y;

		// compute running sum per thread

		vals [ thread_block_id ] = 0;

		for ( int jj = row_start + lane; jj < row_end; jj += WARP_SZ){

			const float xv = tex1Dfetch(texX,colind[jj]);

			vals [ thread_block_id ] += matrix [jj] * xv;

		}

		// parallel reduction in shared memory

		if ( lane < 16) vals [ thread_block_id ] += vals [ thread_block_id + 16];

		if ( lane < 8) vals [ thread_block_id ] += vals [ thread_block_id + 8];

		if ( lane < 4) vals [ thread_block_id ] += vals [ thread_block_id + 4];

		if ( lane < 2) vals [ thread_block_id ] += vals [ thread_block_id + 2];

		if ( lane < 1) vals [ thread_block_id ] += vals [ thread_block_id + 1];

		// first thread writes the result

		if (lane == 0){

			b[ row ] = vals [ thread_block_id];

		}

	}

vvotan · April 17, 2009, 4:14pm

Tried it on GTX260. It’s better but still slower than the scalar implementation(thar uses a thread per row rather than a warp per row)

jack · April 17, 2009, 5:42pm

I think the semi-official nVidia source for sparse matrix multiplication was posted in the General Computing forum a few months back, if you want to take a look at it.

vvotan · April 17, 2009, 6:31pm

I’m aware of that. In fact, I’m trying to mix the implementation by Buatois that uses the blocked CRS format with some optimizations from NVIDIA’s paper. And it does not work wery well. But after searching the general computing forum for the topic you pointed out I found an interesting statement

maybe it is my case, need to examine the test matrices.

Topic		Replies	Views
Help with SpMV algorithm the warped version of the CSR SpMV from the Programming Guide CUDA Programming and Performance	5	2514	January 7, 2010
performace question:sparse matrix multiplication CUDA Programming and Performance	3	6944	April 26, 2007
Sparse Matrix-Vector Multiplication on CUDA CUDA Programming and Performance	79	313740	November 22, 2010
matrix-vector mult in CSR format CUDA Programming and Performance	1	1143	November 12, 2010
SPMV and precaching improving speed of matrix vector multiplication CUDA Programming and Performance	6	974	April 5, 2011
Using spmv with large sparse matrix CUDA Programming and Performance	11	29569	March 30, 2011
How to accelerate N-Vector Scalar Product CUDA Programming and Performance	4	5070	August 8, 2007
New Version of the Sparse Matrix Vector Multiplication Library from IBM Research CUDA Programming and Performance	0	1820	September 10, 2009
New Version of the Sparse Matrix Vector Multiplication Library from IBM Research CUDA Programming and Performance	0	2782	September 10, 2009
slow performance cusparse spmv CUDA Programming and Performance	14	2941	December 9, 2013

SpMV implementation problem

Related topics