Opposite of Stream Compaction?


I’m looking for a particular algorithm, which I’ve not had much luck searching for.

The principal idea is that I have data arranged such that there are a number of objects stored in bins:

0 1 2 3 4 5 6 7 8 9 10

    • 2 - 5 - 2 - 1 2 3

This data can only be accessed from one big contiguous block, which I have used a parallel radix sort on the input data to organise it like so:

2 2 4 4 4 4 4 6 6 8 9 9 10 10 10

Now in order to process this data in parallel (this part is distributed), I need to generate the start indices of this data, which should look something like this (apologies if there are any typos in this data):

0 1 2 3 4 5 6 7 8 9 10
0 0 0 2 2 7 7 9 10 11 13

This kind of feels like an inverse stream compaction and I’m fairly sure the key to this is a scan of some sort, but I trying to figure out the exact algorithm that might be involved.

I’ll possibly figure this out moments after I’ve posted this, but any suggestions would be most welcome!


Yes this is a standard trick. I’ve long since switched to another technique (more on that in a moment), so I forget all the intricacies. You should be able to figure them out from this thrust example which does exactly what you are asking about: http://code.google.com/p/thrust/source/browse/examples/bucket_sort2d.cu

In my application (MD) fast atomics on Fermi (even faster on Kepler) I find that an atomic op version of this is significantly faster. I just run one thread per element, determine the bin, atomicAdd to that bin’s size counter, and then write the data to the appropriate location in the bin’s data (location returned by the atomicAdd). Due to indeterministic thread execution, items end up in a different order each time the kernel is run. This is not a huge problem for me.

Brilliant, just what I was looking for. Thanks.

Interesting that you get faster results with atomics, that’s really useful to know. I guess it probably depends on the data set to a certain extent? In the case of a lot of bin collisions I’m guessing that this technique may not prove to be as optimal, however in my case the data is fairly spread out so I would imagine that I’ll see a similar performance boost. The indeterminism might be an issue, but maybe it’s possible to run a local sort on the results to fix this and still get an overall speed up.

I’ll give it a try.