Segmented Reduction for Fixed Length Segments

MrVandemar · March 20, 2014, 3:46pm

Hi there

I need a segmented reduction to sum fixed length segments. For example, I have an array of N=200 float2’s, representing complex values, and I wish to sum each sub-segment of 10 values to produce N/10 == 20 sums. In fact, my problem dimension is 5D and I wish to sum the last dimension to produce a 4D result.

I’m aware of reduce-by-key implementations which, as I understand, generally require N integer keys for N data values. By contrast, libraries such as moderngpu implement segmented reduction where variable lengths segments can be specified.

Ideally, I’d like to avoid using reduce-by-key in a bid to reduce memory usage. moderngpu avoids this to some extent, but I’d imagine it sacrifices some performance due to its support for variable length segments as well as still requiring memory for describing segment lengths. Is this accurate?

I’m thinking of working forward from Sengupta’s “Scan Primitives for GPU Computing”, but discarding the use of flags, since I’d be able to determine the boundaries of my segments. Is this a reasonable approach?

thanks!

allanmac · March 20, 2014, 6:49pm

I may be misunderstanding your problem but if you have fixed length segments then there are many opportunities to optimize small array operations.

One approach is to be entirely warp-centric and load 320 float2’s – 10 “rows” of 32 float2’s. Then transpose the 10x32 in-register array (via shared mem) so that a 10 float2 subarray is now in a single lane. Then just add them up and write the result out. Avoid bank conflicts, etc. (Not sure I understand your ‘5D’ requirement)

Another more exotic and maybe slower approach would be to load an 800 float2 subarray (4x200) into 25 float2’s (50 registers per lane) and leverage SHFL to avoid using shared.

This would require writing a monolithic and unrolled reduction routine that handles the irregular layout. The performance idea is that if you’re OK with the sum total only being in the last (10th) register you can get clever with pipelining inter-lane additions by using SHFL, SELP and lane masks.

row0: 10:10:10:2 
row1: 8:10:10:4
row2: 6:10:10:6
      ...
row24: ...

I’m dubious this would be faster than transposing but it’s an interesting thinking problem.

One last warp-centric idea, why not just perform a traditional prefix sum across your entire problem (or 4x200 float2’s) and then subtract out the previous sub-segment’s 10th float2 total from the 10th float2 in the current segment? You can do all of this in one warp - 25 warp prefix sums with the last lane as input to the next prefix sum. This should be very fast on a device that supports SHFL… 300-400 ops. But transposing is probably going to be even faster.

Robert_Crovella · March 21, 2014, 2:48am

Doing many fixed length reductions is equivalent to summing the rows (or columns) of a matrix.

If you can arrange it so you are summing columns (i.e. arrange your data in column-major order), you can have a very efficient sum-of-columns with a trivial kernel with a for-loop.

(I think allanmac said this ^^ already.)

Summing either rows or columns of a matrix can also be done using CUBLAS. These generally involve some extra “wasted” arithmetic, but since your reduction problem is hugely memory bound, it may not matter much.

http://stackoverflow.com/questions/17862078/cuda-code-for-sum-of-rows-of-a-matrix-too-slow/17864583#17864583

MrVandemar · March 25, 2014, 10:32am

Thank you for the replies, I wasn’t aware of the SHFL approach.

Regarding the 5D size of the problem, its something along the lines of [4, B, C, T, S] where B, C, T and S are configurable dimension sizes. I want to go from a 5D [4, B, C, T, S] to 4D [4, B, C, T] by summing up the S dimension. As txbob points out, its essentially summing up matrix rows.

allanmac and txbob, as I understand it you’re advocating summation over each fixed length segment. This seems sensible for reasonably small segments, (say 10 to 100 values). Would you advocate this approach if S was larger (say 10,000 to 100,000 elements)?

For larger numbers, I would expect that allanmac’s suggestion of a prefix sum followed by subtraction at each segment boundary to win out, because its a parallel reduction vs a serial summation. Is this correct? I wonder if there could be floating point precision issues from performing a prefix sum over a large amount of data?

Topic		Replies	Views
Is there a block equivalent to cub::DeviceSegmentedReduce CUDA Programming and Performance	7	1260	October 10, 2023
Segmented Reduction of small subarrays CUDA Programming and Performance	17	1372	March 24, 2022
How to do Parallel Reduction of many unequally sized arrays in CUDA? CUDA Programming and Performance	1	13030	November 24, 2009
How to perform multiple small reduction efficiently? CUDA Programming and Performance	3	920	May 24, 2013
Reduce choice CUDA Programming and Performance	25	65	March 23, 2025
Code optimization how to optimize CUDA Programming and Performance	6	1072	June 1, 2011
Any good ideas for this special "reduction" ? CUDA Programming and Performance	10	6799	November 20, 2009
Reduce sum in shared memory using CUB CUDA Programming and Performance cuda , kernel , performance	9	369	October 3, 2024
Parallel sum reduction 2D CUDA Programming and Performance	10	100	January 10, 2025
2D convolutions with changing kernel but fixed data CUDA Programming and Performance	5	8331	December 16, 2008

Segmented Reduction for Fixed Length Segments

Related topics