Hi there
I need a segmented reduction to sum fixed length segments. For example, I have an array of N=200 float2’s, representing complex values, and I wish to sum each sub-segment of 10 values to produce N/10 == 20 sums. In fact, my problem dimension is 5D and I wish to sum the last dimension to produce a 4D result.
I’m aware of reduce-by-key implementations which, as I understand, generally require N integer keys for N data values. By contrast, libraries such as moderngpu implement segmented reduction where variable lengths segments can be specified.
Ideally, I’d like to avoid using reduce-by-key in a bid to reduce memory usage. moderngpu avoids this to some extent, but I’d imagine it sacrifices some performance due to its support for variable length segments as well as still requiring memory for describing segment lengths. Is this accurate?
I’m thinking of working forward from Sengupta’s “Scan Primitives for GPU Computing”, but discarding the use of flags, since I’d be able to determine the boundaries of my segments. Is this a reasonable approach?
thanks!
I may be misunderstanding your problem but if you have fixed length segments then there are many opportunities to optimize small array operations.
One approach is to be entirely warp-centric and load 320 float2’s – 10 “rows” of 32 float2’s. Then transpose the 10x32 in-register array (via shared mem) so that a 10 float2 subarray is now in a single lane. Then just add them up and write the result out. Avoid bank conflicts, etc. (Not sure I understand your ‘5D’ requirement)
Another more exotic and maybe slower approach would be to load an 800 float2 subarray (4x200) into 25 float2’s (50 registers per lane) and leverage SHFL to avoid using shared.
This would require writing a monolithic and unrolled reduction routine that handles the irregular layout. The performance idea is that if you’re OK with the sum total only being in the last (10th) register you can get clever with pipelining inter-lane additions by using SHFL, SELP and lane masks.
row0: 10:10:10:2
row1: 8:10:10:4
row2: 6:10:10:6
...
row24: ...
I’m dubious this would be faster than transposing but it’s an interesting thinking problem.
One last warp-centric idea, why not just perform a traditional prefix sum across your entire problem (or 4x200 float2’s) and then subtract out the previous sub-segment’s 10th float2 total from the 10th float2 in the current segment? You can do all of this in one warp - 25 warp prefix sums with the last lane as input to the next prefix sum. This should be very fast on a device that supports SHFL… 300-400 ops. But transposing is probably going to be even faster.
Doing many fixed length reductions is equivalent to summing the rows (or columns) of a matrix.
If you can arrange it so you are summing columns (i.e. arrange your data in column-major order), you can have a very efficient sum-of-columns with a trivial kernel with a for-loop.
(I think allanmac said this ^^ already.)
Summing either rows or columns of a matrix can also be done using CUBLAS. These generally involve some extra “wasted” arithmetic, but since your reduction problem is hugely memory bound, it may not matter much.
http://stackoverflow.com/questions/17862078/cuda-code-for-sum-of-rows-of-a-matrix-too-slow/17864583#17864583
Thank you for the replies, I wasn’t aware of the SHFL approach.
Regarding the 5D size of the problem, its something along the lines of [4, B, C, T, S] where B, C, T and S are configurable dimension sizes. I want to go from a 5D [4, B, C, T, S] to 4D [4, B, C, T] by summing up the S dimension. As txbob points out, its essentially summing up matrix rows.
allanmac and txbob, as I understand it you’re advocating summation over each fixed length segment. This seems sensible for reasonably small segments, (say 10 to 100 values). Would you advocate this approach if S was larger (say 10,000 to 100,000 elements)?
For larger numbers, I would expect that allanmac’s suggestion of a prefix sum followed by subtraction at each segment boundary to win out, because its a parallel reduction vs a serial summation. Is this correct? I wonder if there could be floating point precision issues from performing a prefix sum over a large amount of data?