Authors of http://www.cse.chalmers.se/~billeter/pub/pp/ claim [font=arial, sans-serif]have outperformed CUDPP 1.0 by a noticeable margin on [/font][font=arial, sans-serif]the scan(), compact() and sort() operations.
The CUDPP sort implementation is commented-out in CUDPP 2.0. Instead the library forwards to thrust::sort. I restored the CUDPP sort and benchmarked it and the B40C sort: http://www.moderngpu.com/sort/mgpusort.html
I will have comparative benchmarks for segmented scan in a couple days which should be interesting.
@sean do you implement a [font=“arial, sans-serif”]4-bits per [/font][font=“arial, sans-serif”]pass, compared to the 2-bits per pass in chag:pp (IIRC) (http://www.cse.chalmers.se/~billeter/pub/pp/)?[/font]
I implement up to six bits per pass. I benchmark the timings for each bit-pass then find the optimal path to a full sort of keys between 1 and 32 bits. But the six bit pass is quite a bit faster than the others. My algorithm description gets into the math of all that.