# Is there an optimization to accelerate the irregular reduction, like benefit from warp reduce?

I have to solve a problem about sparse matrix-vector multiplication on-the-fly. It means that, the basic problem is sparse matrix-vector multiplication, however, each of entry in the matrix is computed
on-the-fly, therefore, I cannot use the cuSparse library.

After several weeks optimization, my code has already speedup more than 10 times compare to my original naive solution, however, I think there must be a room for further speedup. One of the place I think can be optimized is the reduction.

Here is the pattern of this sparse matrix. The size of matrix is O(1000) X O(10000), but each column
just have at most 5 successive non-zeros, and these positions circular shift 1 or 0 in the neighbor column. If I draw a figure, it’s something like this:

0 0 0 0 0 0 0
0 0 0 0 0 0 0
. . . . . . .
. . . . . . .
. . . . . x 0
x 0 0 0 0 x x
x x 0 0 0 … x x
x x x 0 0 … 0 x
0 x x x 0 … 0 0
0 x x x x … 0 0
0 0 0 x x 0 0
0 0 0 0 x 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
. . . . . . .
. . . . . . .
. . . . . . .

Because this problem cannot be solved in just one block, I split the vector into a lot of small pieces (e.g. 1024) such that smaller matrix-vector multiplication can be computed in a block.

For a matrix-vector multiplication in this case, y = A * x , because I know each column are very sparse, I have a share memory which has the same size as y , and for each block thread, my code is like following:

``````for(int i = firstNonZeorsPos, i < lastNonZerosPos; i++)
{
// some code to computed the matrix entry

float a = matrixEntry();
float b = a * input;
if (b != 0 )
``````atomicAdd(&y[i], b);