 # Is there an optimization to accelerate the irregular reduction, like benefit from warp reduce?

I have to solve a problem about sparse matrix-vector multiplication on-the-fly. It means that, the basic problem is sparse matrix-vector multiplication, however, each of entry in the matrix is computed on-the-fly, therefore, I cannot use the cuSparse library.

After several weeks’ optimization, my code has already speedup more than 10 times compare to my original naive solution, however, I think there must be a room for further speedup. One of the place I’m thinking can be optimized is the reduction part.

Here is the pattern of this sparse matrix. The size of matrix is O(1000) X O(10000), but each column
just have at most 5 successive non-zeros, and these positions circular shift 1 or 0 in the neighbor column. If I draw a figure, it’s something like this:

0 0 0 0 0 . . . 0 0
0 0 0 0 0 . . . 0 0
… … . . . 0 0
… … . . . x 0
… … . . . x x
x 0 0 0 0 … x x
x x 0 0 0 … 0 x
x x x 0 0 … 0 0
0 x x x 0 … 0 0
0 x x x x … 0 0
0 0 0 x x … 0 0
0 0 0 0 x … 0 0
0 0 0 0 0 … 0 0
0 0 0 0 0 … 0 0
. . . . . … . .
. . . . . … . .
. . . . . … . .

Because this problem cannot be solved in just one block, I split the vector into a lot of small pieces (e.g. 1024) such that smaller matrix-vector multiplication can be computed in a block.

For a matrix-vector multiplication in this case, y = A * x , because I know each column are very sparse, I have a share memory which has the same size as y , and for each block thread, my code is like following:

``````for(int i = firstNonZeorsPos, i <= lastNonZerosPos; i++)
{
// some code to computed the matrix entry

float a = matrixEntry();
float b = a * input;
if (b != 0 )
}
``````

I measured that, this atomicAdd takes 11.5% compute time, and since the each row of the matrix is not necessary to be sparse, the atomicAdd will take lots of time. I’m just wondering that if there is good way to optimize it. I though about using the warp reduction to avoid some share memory loading and storing. But the problem is that, this reduction is not a regular reduction, because

``````atomicAdd(&y[i], b);
``````

is not perform on a position that has relation with thread ID.

Without going into the details of your particular problem, as you know it better, there are a few things we can recall about atomics.
If I am not mistaken, atomicAdd is slower for floats than integers, which could be your case. I don’t know what card you are using, but Maxwell and Pascal have native shared memory atomics (for integers), and a Nvidia blog demonstrating histograms with global and shared memory shows the difference.

There is always the friendly reminder “profile a release version and not a debug”, and it is also wise to search for threads specific on atomics. Maybe you are not doing anything wrong, but you have to compare it to other posts around.

My card is Pascal, Titan X. In my case, I have to use float number, because this is about scientific computing.

For my profiling, I do compile the release version, but also include all the optimization flags I can add.

It’s not obvious to me why you can’t use some form of warp-level aggregation or warp reduction, or indeed a general paralllel reduction, to improve this, by reducing or mostly obviating the need for atomics.

I assume the banding pattern in your matrix means that although each column has only at most 5 non-zeros, that each row can still have more than 5 non-zero values. (You appear to be indicating ~1000 rows but ~10000 columns).

The idea would be to have a set of threads handle each row of the matrix, computing the individual products, and then using a warp-level or higher reduction to create the final product (for that matrix row).

You would need to compute the indexing, based on the banding pattern, so that the entire row is covered. If the number of active threads per row is small, then you could use a 2D grid to increase the thread count. If the number of active threads per row is large, you could simply have a single “row” of threads that iterated through the matrix rows, computing new banding positions at each iteration. Based on the description of the banding pattern, this per-iteration-computation seems like it would be fairly simple.

It seems that if your banding pattern does not repeat within a 1000x1000 subset of the matrix (seems to be the case if you only have 5 non-zeros per column) then you would have at most ~50 non-zero elements per row to “reduce”. This would suggest the 2D grid, not the 1D striding grid, so as to employ 50x1000 threads, rather than 50 threads.