I need a segmented reduction to sum fixed length segments. For example, I have an array of N=200 float2’s, representing complex values, and I wish to sum each sub-segment of 10 values to produce N/10 == 20 sums. In fact, my problem dimension is 5D and I wish to sum the last dimension to produce a 4D result.
I’m aware of reduce-by-key implementations which, as I understand, generally require N integer keys for N data values. By contrast, libraries such as moderngpu implement segmented reduction where variable lengths segments can be specified.
Ideally, I’d like to avoid using reduce-by-key in a bid to reduce memory usage. moderngpu avoids this to some extent, but I’d imagine it sacrifices some performance due to its support for variable length segments as well as still requiring memory for describing segment lengths. Is this accurate?
I’m thinking of working forward from Sengupta’s “Scan Primitives for GPU Computing”, but discarding the use of flags, since I’d be able to determine the boundaries of my segments. Is this a reasonable approach?