In-place Sorting Algorithms on CUDA?


I’ve been investigating sorting algorithms and I’m going to be hitting the GPU memory limit, so I’d very much like to use an in-place algorithm if possible, but I’m concerned I’ll take a big hit in speed over using an out-of-place radix sort for example.

I’m primarily concerned about sorting (int, int) key-value pairs, and the data I will be sorting will already be reasonably ordered.

It seems the best in-place sorting currently uses the bitonic sort:

Are there any other sorting algorithms (or combinations of algorithms) that I should consider that has lower complexity than the bitonic sort, but also lower memory usage than the radix sort?


I mention this algorithm without any regard for its suitability to a CUDA implementation (I haven’t really thought too much about it). The smoothsort:

It is worst case O(n log n) and in the best case of already ordered data it is O(n) with a smooth transition between the two as the input becomes more ordered. Its additional memory requirements are very low although it isn’t strictly in-place.

It is might be possible to implement efficiently with CUDA, but it is not likely to be easy.

You can also try GPU Quicksort.

A good paper on gpu sorting is the following: