Is there an optimization to accelerate the irregular reduction, like benefit from warp reduce?

I have to solve a problem about sparse matrix-vector multiplication on-the-fly. It means that, the basic problem is sparse matrix-vector multiplication, however, each of entry in the matrix is computed on-the-fly, therefore, I cannot use the cuSparse library.

After several weeks’ optimization, my code has already speedup more than 10 times compare to my original naive solution, however, I think there must be a room for further speedup. One of the place I’m thinking can be optimized is the reduction part.

Here is the pattern of this sparse matrix. The size of matrix is O(1000) X O(10000), but each column
just have at most 5 successive non-zeros, and these positions circular shift 1 or 0 in the neighbor column. If I draw a figure, it’s something like this:

0 0 0 0 0 . . . 0 0
0 0 0 0 0 . . . 0 0
… … . . . 0 0
… … . . . x 0
… … . . . x x
x 0 0 0 0 … x x
x x 0 0 0 … 0 x
x x x 0 0 … 0 0
0 x x x 0 … 0 0
0 x x x x … 0 0
0 0 0 x x … 0 0
0 0 0 0 x … 0 0
0 0 0 0 0 … 0 0
0 0 0 0 0 … 0 0
. . . . . … . .
. . . . . … . .
. . . . . … . .

Because this problem cannot be solved in just one block, I split the vector into a lot of small pieces (e.g. 1024) such that smaller matrix-vector multiplication can be computed in a block.

For a matrix-vector multiplication in this case, y = A * x , because I know each column are very sparse, I have a share memory which has the same size as y , and for each block thread, my code is like following:

for(int i = firstNonZeorsPos, i <= lastNonZerosPos; i++)
    // some code to computed the matrix entry

    float a = matrixEntry();
    float b = a * input;
    if (b != 0 )
        atomicAdd(&y[i], b);

I measured that, this atomicAdd takes 11.5% compute time, and since the each row of the matrix is not necessary to be sparse, the atomicAdd will take lots of time. I’m just wondering that if there is good way to optimize it. I though about using the warp reduction to avoid some share memory loading and storing. But the problem is that, this reduction is not a regular reduction, because

atomicAdd(&y[i], b);

is not perform on a position that has relation with thread ID.

Anyone has ideas about this optimization?

Without going into the details of your particular problem, as you know it better, there are a few things we can recall about atomics.
If I am not mistaken, atomicAdd is slower for floats than integers, which could be your case. I don’t know what card you are using, but Maxwell and Pascal have native shared memory atomics (for integers), and a Nvidia blog demonstrating histograms with global and shared memory shows the difference.

There is always the friendly reminder “profile a release version and not a debug”, and it is also wise to search for threads specific on atomics. Maybe you are not doing anything wrong, but you have to compare it to other posts around.

Thanks for your reply.

My card is Pascal, Titan X. In my case, I have to use float number, because this is about scientific computing.

For my profiling, I do compile the release version, but also include all the optimization flags I can add.

It’s not obvious to me why you can’t use some form of warp-level aggregation or warp reduction, or indeed a general paralllel reduction, to improve this, by reducing or mostly obviating the need for atomics.

I assume the banding pattern in your matrix means that although each column has only at most 5 non-zeros, that each row can still have more than 5 non-zero values. (You appear to be indicating ~1000 rows but ~10000 columns).

The idea would be to have a set of threads handle each row of the matrix, computing the individual products, and then using a warp-level or higher reduction to create the final product (for that matrix row).

You would need to compute the indexing, based on the banding pattern, so that the entire row is covered. If the number of active threads per row is small, then you could use a 2D grid to increase the thread count. If the number of active threads per row is large, you could simply have a single “row” of threads that iterated through the matrix rows, computing new banding positions at each iteration. Based on the description of the banding pattern, this per-iteration-computation seems like it would be fairly simple.

It seems that if your banding pattern does not repeat within a 1000x1000 subset of the matrix (seems to be the case if you only have 5 non-zeros per column) then you would have at most ~50 non-zero elements per row to “reduce”. This would suggest the 2D grid, not the 1D striding grid, so as to employ 50x1000 threads, rather than 50 threads.

Thank you for reply.

I did a roughly statistics on the nonzeros of row, it will be at most 700 nonzeros each row when the matrix is 1000 x 16000, nonzeros of each row will also increase with the increase of the size of the matrix, because the band do repeat and the band is not continues from top of the matrix to the bottom.

Because I have no priori-knowledge about the index of these none zeros, the positions also need to be computed on fly. This banding pattern is not regular, because, sometimes it shift by 1, sometimes it doesn’t shift, and I don’t know when it could be repeated. I use matlab spy to draw the pattern of the matrix, it’s like this:
The image shown here is just an example, when the parameter in my whole problem changes, the pattern will also change (e.g. the band becomes longer, or the slope of the band change), but the basic pattern is like this.

Therefore, I have no idea that how many threads I need to launch, maybe I can only do a very rough estimation. But I don’t know if this will waste a lot of threads, because all the threads need to check whether this position is zero.

Last thing is that, due to the feature of my problem, deciding the position of nonzeros in each column is easy, however, given a row, decide which index is non-zeros is not easy. Here, ‘easy’ I mean the computational cost.

But maybe I’m wrong, I will try your suggestion and do an experiment first. Again, I’m so appreciating for your help.