I was curious about what algorithms people use here to sort data on the GPU. The bitonic sort example NVIDIA proposes in the template projects only works for n elems = n threads and as such has some serious limitations (max 512 elems to sort, and then only 16 registeres available per thread).
Has anybody tackled the problem yet? I have a big array of 1024-2048 elems I want to sort optimally (some variations of bin sort-bubble sort-selection sort I tried we’re far from satisfactory with performance /6 or /10 compared to bitonic)
I read about the adaptive bitonic sort but cannot find some meta code or implementation?
To sort larger arrays you would have each block sort a subset of the array in shared memory, and then merge the sorted sub-arrays. Unfortunately doing a merge efficiently in parallel is not easy.
I did implement a “load-balanced” radix sort using CUDA. However it’s pretty slow since I am new to CUDA. But can u suggest what kind of radix sort will be used in the next release? Just want to make sure I am digging into the right way. Thanks!