Hello there,

I have implemented the two algorithms for Sparse Matrix X Vector Multiplication as they are in the CUDA Programming Guide. The second version being the warped one as follows:

[codebox]**global** void

spmv_csr_vector_kernel (

```
const float * data,
const int * indices,
const int * ptr,
const float * x,
float * y,
const int num_rows
)
```

{

```
__shared__ float vals [512];
int thread_id = blockDim.x * blockIdx.x + threadIdx.x ; // global thread index
int warp_id = thread_id / 32; // global warp index
int lane = thread_id & (32 - 1); // thread index within the warp
// one warp per row
int row = warp_id ;
if ( row < num_rows ){
int row_start = ptr [row ];
int row_end = ptr [ row +1];
// compute running sum per thread
vals [ threadIdx.x ] = 0;
for ( int jj = row_start + lane ; jj < row_end ; jj += 32)
vals [ threadIdx.x ] += data [jj] * x[ indices [jj ]];
// parallel reduction in shared memory
if ( lane < 16) vals [ threadIdx.x ] += vals [ threadIdx.x + 16];
if ( lane < 8) vals [ threadIdx.x ] += vals [ threadIdx.x + 8];
if ( lane < 4) vals [ threadIdx.x ] += vals [ threadIdx.x + 4];
if ( lane < 2) vals [ threadIdx.x ] += vals [ threadIdx.x + 2];
if ( lane < 1) vals [ threadIdx.x ] += vals [ threadIdx.x + 1];
// first thread writes the result
if ( lane == 0)
y[ row ] += vals [ threadIdx.x ];
}
```

}[/codebox]

When I print out the result vector “y”, it contains only one/32th of the expected values. It is what it should return, I think, provided that the last two lines specify

[codebox]if ( lane == 0)

```
y[ row ] += vals [ threadIdx.x ];[/codebox]
```

I do not understand how to use that algorithm, although I have read about warps. I call it from the host with block_size of 512 and the number of blocks is number of rows in the matrix divided by the block_size (512). The result is a vector that contains only the first 1/32 values of what the non-warped version returns.

Any discussion is highly appreciated.

Cheers.