Help with SpMV algorithm the warped version of the CSR SpMV from the Programming Guide

Hello there,

I have implemented the two algorithms for Sparse Matrix X Vector Multiplication as they are in the CUDA Programming Guide. The second version being the warped one as follows:

[codebox]global void

spmv_csr_vector_kernel (

					const float * data,

					const int * indices,

					const int * ptr,

					const float * x,

					float * y,

					const int num_rows

					)

{

__shared__ float vals [512];

int thread_id = blockDim.x * blockIdx.x + threadIdx.x ; // global thread index

int warp_id = thread_id / 32; // global warp index

int lane = thread_id & (32 - 1); // thread index within the warp



// one warp per row

int row = warp_id ;

if ( row < num_rows ){

	int row_start = ptr [row ];

	int row_end = ptr [ row +1];

	// compute running sum per thread

	vals [ threadIdx.x ] = 0;

	for ( int jj = row_start + lane ; jj < row_end ; jj += 32)

	vals [ threadIdx.x ] += data [jj] * x[ indices [jj ]];

	// parallel reduction in shared memory

	if ( lane < 16) vals [ threadIdx.x ] += vals [ threadIdx.x + 16];

	if ( lane < 8) vals [ threadIdx.x ] += vals [ threadIdx.x + 8];

	if ( lane < 4) vals [ threadIdx.x ] += vals [ threadIdx.x + 4];

	if ( lane < 2) vals [ threadIdx.x ] += vals [ threadIdx.x + 2];

	if ( lane < 1) vals [ threadIdx.x ] += vals [ threadIdx.x + 1];

	// first thread writes the result

	if ( lane == 0)

		y[ row ] += vals [ threadIdx.x ];

}

}[/codebox]

When I print out the result vector “y”, it contains only one/32th of the expected values. It is what it should return, I think, provided that the last two lines specify

[codebox]if ( lane == 0)

		y[ row ] += vals [ threadIdx.x ];[/codebox]

I do not understand how to use that algorithm, although I have read about warps. I call it from the host with block_size of 512 and the number of blocks is number of rows in the matrix divided by the block_size (512). The result is a vector that contains only the first 1/32 values of what the non-warped version returns.

Any discussion is highly appreciated.

Cheers.

The code looks OK, so I think you are just launching too few blocks. Since we assign one 32-thread warp per matrix row you need to launch one 512-thread block for every 16 rows (512/32 = 16). Here’s a version of that code using 128-thread blocks:

http://code.google.com/p/cusp-library/sour…sr_vector.h#118

Thanks a lot. So, I have changed that it is

block_size = 512; //threads

number_of_blocks = csr.num_rows / 16;

instead of

block_size = 512; //threads

number_of_blocks = csr.num_rows / block_size;

As you have said, it is 32 warps per row, so if block size is 512 => one block per 16 rows. Although, this works for a relatively small (3060x3060) matrix, when the input is 2m x 2m it just returns a “y” vector full of zeros. I can’t see why it is not working.

It seems that it does not work because it overruns the maximum number of blocks allowed which is 65 535. Whilst it seeks 2m/16 ~= 190 000. I duno yet how to get around this but even then there is another problem. When I compute on that matrix with 65 535 blocks => not the the hole vector y is returned and it takes around 0.3 seconds. Using the non-warped version takes about 0.07 seconds for the whole SpMV. This is confusing to me.

EDIT:

I have reduced the warp size to 4 since : 1. this will allow me to use less than 65k number of block ; 2. each row in my matrix has exactly 5 non-zeros. However it still takes 0.15s to compute the SpMV kernel. Could that be because of the data distribution in the matrix or not? I assume that having 5 non zeros per row and setting the warp size as 4 would bring optimal performance (but it doesnt…)

There are two ways you can get around the 65535 block limit, either use a 2D grid of blocks, or have threads process multiple rows. IMO the second approach is better and it’s what we do in our version of the CSR kernel.

Thank you very much for your reply. I should also acknowledge that the topics sub title is wrong. The algorithm is not from the programming guide, it is form the paper written by Nathan Bell and Michael Garland. I shall try to appropriately edit it.

I am struggling to figure out what are the differences between your implementation and what I have listed above. I still owe it a lot of detailed work though …

I also afraid that this approach will consume more time than the straight-forward scalar kernel. At least the one I have listed performs 6 times slower :(