I need a efficient way to Sort an uint array (if possible in parallel) at thread Block level (i.e. one array per thread block), the size of an array isn’t big, lets say from 100 to 1000.
I think that cudpp has a sorting function for small arrays that uses radix sort which should be better than bitonic sort although the number of elements is small so you have to try it.