Sort very small array in shared with 1 warps

I recently encountered the necessity to sort, as fast I can, very small arrays (i.e. with less than 5000 elements) stored in shared memory. As restrictions, I have only 1 warp available and I can’t use extra memory. Sequential sorts are ok but I am wondering if I can obtain better results with 32 threads. From what I can find online, bitonic sort and radix sort are the most used on GPU, usually to sort big arrays stored in global. What do you suggest?

We seem to have different notions of “very small” :-) When I saw that in question title, I was going to suggest a sorting network. But that isn’t the way to go for more than 20 elements or so.

Is there any structure in the data that could help us? For example: Are the elements integers? Are element values confined to a narrow range? Is the array content completely random or mostly sorted?

Your questions make sense. The elements are unsigned integers, a reasonable upper bound for the values is 100000 and they are random.

I don’t know of any parallel library implementations subject to the restrictions of 1 warp and no extra memory use. Thrust can do a per-thread sequential sort, but I think it will still require significant (O(N)) temporary memory allocation.

My best guess would be a set of sorts of tractable size followed by a set of merges.

a warp-wide sort is here but it only sorts 32 values. However it could be the precursor to a sequence of merges. I’m not sure the merges can be done with zero additional memory, however.

I think an odd-even sort can do sorting with zero extra memory. It should be straightforward to implement at the warp level. There is a sample odd-even sort in the CUDA sample codes. That might be the best fit to use 32 threads with zero extra memory. The most optimal flavor of this may be the batcher’s odd-even mergesort

A bit of googling suggests there are both in-place radix and in-place merges possible. I’m not aware of any GPU implementations or parallel implementations.

Later: I think you could adapt the oddEvenMergeSortShared kernel in the sample code I mentioned: /usr/local/cuda/samples/6_Advanced/sortingNetworks/ to do this. Roughly speaking, you would convert the sync statements to __syncwarp(), and insert a warp-stride loop after each sync statement, before the remaining body of the “odd” and “even” phases. This would also be a bit easier if you could pad your shared data to be a multiple of 32.

I think in-place radix sort and odd-even sort/mergesort are the most reasonable approaches.

Edit: Thank you for the reference, that is definitely a good starting point.