sorting on the GPU

I was curious about what algorithms people use here to sort data on the GPU. The bitonic sort example NVIDIA proposes in the template projects only works for n elems = n threads and as such has some serious limitations (max 512 elems to sort, and then only 16 registeres available per thread).

Has anybody tackled the problem yet? I have a big array of 1024-2048 elems I want to sort optimally (some variations of bin sort-bubble sort-selection sort I tried we’re far from satisfactory with performance /6 or /10 compared to bitonic)

I read about the adaptive bitonic sort but cannot find some meta code or implementation?

To sort larger arrays you would have each block sort a subset of the array in shared memory, and then merge the sorted sub-arrays. Unfortunately doing a merge efficiently in parallel is not easy.

Here’s a reference on adaptive bitonic sort:
[url=“Institute of Computer Science II”]Institute of Computer Science II

GPUsort uses a similar algorithm and you can download their implementation here:
[url=“GPUSORT”]http://gamma.cs.unc.edu/GPUSORT/index.html[/url]

We have an efficient CUDA implementation of radix sort which should be in a future release of the SDK.

I did implement a “load-balanced” radix sort using CUDA. However it’s pretty slow since I am new to CUDA. But can u suggest what kind of radix sort will be used in the next release? Just want to make sure I am digging into the right way. Thanks!