Is there an optimization to accelerate the irregular reduction, like benefit from warp reduce?

I have to solve a problem about sparse matrix-vector multiplication on-the-fly. It means that, the basic problem is sparse matrix-vector multiplication, however, each of entry in the matrix is computed
on-the-fly, therefore, I cannot use the cuSparse library.

After several weeks optimization, my code has already speedup more than 10 times compare to my original naive solution, however, I think there must be a room for further speedup. One of the place I think can be optimized is the reduction.

Here is the pattern of this sparse matrix. The size of matrix is O(1000) X O(10000), but each column
just have at most 5 successive non-zeros, and these positions circular shift 1 or 0 in the neighbor column. If I draw a figure, it’s something like this:

0 0 0 0 0 0 0
0 0 0 0 0 0 0
. . . . . . .
. . . . . . .
. . . . . x 0
x 0 0 0 0 x x
x x 0 0 0 … x x
x x x 0 0 … 0 x
0 x x x 0 … 0 0
0 x x x x … 0 0
0 0 0 x x 0 0
0 0 0 0 x 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
. . . . . . .
. . . . . . .
. . . . . . .

Because this problem cannot be solved in just one block, I split the vector into a lot of small pieces (e.g. 1024) such that smaller matrix-vector multiplication can be computed in a block.

For a matrix-vector multiplication in this case, y = A * x , because I know each column are very sparse, I have a share memory which has the same size as y , and for each block thread, my code is like following:

for(int i = firstNonZeorsPos, i < lastNonZerosPos; i++)
{
    // some code to computed the matrix entry

    float a = matrixEntry();
    float b = a * input;
    if (b != 0 )
        atomicAdd(&y[i], b);
}

I measured that, this atomicAdd takes 11.5% compute time, and since the each row of the matrix is not necessary to be sparse, the atomicAdd will take lots of time. I’m just wondering that if there is good way to optimize it. I though about using the warp reduction to avoid some share memory loading and storing. But the problem is that, this reduction is not a regular reduction, because

atomicAdd(&y[i], b);

is not perform on a position that has relation with thread ID.

Anyone has ideas about this optimization?