I was curious about what algorithms people use here to sort data on the GPU. The bitonic sort example NVIDIA proposes in the template projects only works for n elems = n threads and as such has some serious limitations (max 512 elems to sort, and then only 16 registeres available per thread).
Has anybody tackled the problem yet? I have a big array of 1024-2048 elems I want to sort optimally (some variations of bin sort-bubble sort-selection sort I tried we’re far from satisfactory with performance /6 or /10 compared to bitonic)
I read about the adaptive bitonic sort but cannot find some meta code or implementation?