Is there a block equivalent to cub::DeviceSegmentedReduce

One possible approach:

Given a value array:

{0, 2, 1, -2, 0, 3, 4}

and a flag array which marks the end of each segment (inclusive):

{0, 1, 1, 0, 1, 0, 1}

we could perform an ordinary prefix sum on the value array:

{0, 2, 3, 1, 1, 4, 8}

and then compute the segment sum results by subtracting values at the flag positions:

prefix sum:                {0, 2, 3, 1, 1, 4, 8}
flags:                     {0, 1, 1, 0, 1, 0, 1}
segment sums:              {   2, 1,   -2,    7}

it might not be obvious how to get to the segment sums. You could do a typical parallel stream-compaction operation on the prefix sums, selecting the values indicated in the flag array. Then do an adjacent difference, prepending 0 to the stream-compacted array.

prefix sum:                {0, 2, 3, 1, 1, 4, 8}
flags:                     {0, 1, 1, 0, 1, 0, 1}
stream-compacted:          {2, 3, 1, 8}
prepend 0:                 {0, 2, 3, 1, 8}
adjacent diff:             {   2, 1,-2, 7}

Therefore, given a parallel prefix sum implementation at the block level (which cub provides) and using it for both the values prefix-sum as well as the stream compaction op, that could be a roadmap to create a block-level segmented sum using other “primitives” - prefix sum, indexed copy, and adjacent difference.

striker159 may know of a more elegant approach