One possible approach:
Given a value array:
{0, 2, 1, -2, 0, 3, 4}
and a flag array which marks the end of each segment (inclusive):
{0, 1, 1, 0, 1, 0, 1}
we could perform an ordinary prefix sum on the value array:
{0, 2, 3, 1, 1, 4, 8}
and then compute the segment sum results by subtracting values at the flag positions:
prefix sum: {0, 2, 3, 1, 1, 4, 8}
flags: {0, 1, 1, 0, 1, 0, 1}
segment sums: { 2, 1, -2, 7}
it might not be obvious how to get to the segment sums. You could do a typical parallel stream-compaction operation on the prefix sums, selecting the values indicated in the flag array. Then do an adjacent difference, prepending 0 to the stream-compacted array.
prefix sum: {0, 2, 3, 1, 1, 4, 8}
flags: {0, 1, 1, 0, 1, 0, 1}
stream-compacted: {2, 3, 1, 8}
prepend 0: {0, 2, 3, 1, 8}
adjacent diff: { 2, 1,-2, 7}
Therefore, given a parallel prefix sum implementation at the block level (which cub provides) and using it for both the values prefix-sum as well as the stream compaction op, that could be a roadmap to create a block-level segmented sum using other “primitives” - prefix sum, indexed copy, and adjacent difference.
striker159 may know of a more elegant approach