Block level segmented reduction (reduce by key)

What is the suggested approach for block based segmented reduction (reduce by key) of integer key/value pairs? Where the pairs are already sorted by key, and the value reduction operation is addition or bitwise OR.

I’ve looked at cub flag head/tail and implemented the reduce by key mention here: http://on-demand.gputechconf.com/gtc/2015/presentation/S5617-Duane-Merrill.pdf. However the values are offset to carry along to the next thread which is not what I wanted (unless I hack around an launch an extra thread).

I’ve tried various attempts at identifying head/tail flag inputs in the pairwise reduction called from exclusive/inclusive scans in cub. Partly maybe as I don’t totally understand the cub scan network order, and it seems a more complicated problem perhaps. So I have failed to come up with a solution using the cub primitives so far.

Producing the correct output offsets from a scan of either head or tail flags is simple enough etc, but the main issue seems to be deciding to reduce pairs correctly with respect to segment boundaries.

If I treat the cub scan network as a black box it maybe seems impossible to do with it, as partial reductions in the scan network that reduced across adjacent segments would lose track of the boundary knowledge needed to correctly reduce pairs. And you have no guarantee what order reduction pairs will happen with respect to the head/tails flags positioning.

In the mentioned GTC paper the reduction operation does: { b.offset ? b.value : a.value + b.value, a.offset + b.offset}. Which I guess is using some explicit knowledge of how cubs scan network orders things?

So at first inspired by that I tried to do something equivalent to:

int keys[] = { 2, 2, 3, 3, 5, 5, 5 };
int values[] = { 1, 1, 0, 1, 8, 1, 0 };
int tails[] = { 0, 1, 0, 1, 0, 0, 1 };
int reductions[7];
int scans[7];

scans[0] = tails[0] == 1 ? -1 : 0;
reductions[0] = values[0];

for (int i = 1; i < 7; ++i)
{
	int a = scans[i - 1];
	int b = tails[i];
	int reduction = values[i];

	if (b == 1)
	{
		scans[i] = -(a + b);
		reduction += reductions[i - 1];
	}
	else
	if (a >= 0)
	{
		scans[i] = a + b;
		reduction += reductions[i - 1];
	}
	else
		scans[i] = (-a) + b;

	reductions[i] = reduction;
}

Which I believe won’t work in cub form due to the algorithms dependency on sequential order and the way the parallelised scan network works in cub (I tested a cub version). My googling has not been able to come up with block based solutions specifically - even though I found mention of such functionality intended to be added to cub at some point.