Sort very small array in shared with 1 warps

95A31 · November 8, 2020, 12:32am

Hello,
I recently encountered the necessity to sort, as fast I can, very small arrays (i.e. with less than 5000 elements) stored in shared memory. As restrictions, I have only 1 warp available and I can’t use extra memory. Sequential sorts are ok but I am wondering if I can obtain better results with 32 threads. From what I can find online, bitonic sort and radix sort are the most used on GPU, usually to sort big arrays stored in global. What do you suggest?

njuffa · November 8, 2020, 12:50am

We seem to have different notions of “very small” :-) When I saw that in question title, I was going to suggest a sorting network. But that isn’t the way to go for more than 20 elements or so.

Is there any structure in the data that could help us? For example: Are the elements integers? Are element values confined to a narrow range? Is the array content completely random or mostly sorted?

95A31 · November 8, 2020, 1:10am

Your questions make sense. The elements are unsigned integers, a reasonable upper bound for the values is 100000 and they are random.

Robert_Crovella · November 8, 2020, 1:42am

I don’t know of any parallel library implementations subject to the restrictions of 1 warp and no extra memory use. Thrust can do a per-thread sequential sort, but I think it will still require significant (O(N)) temporary memory allocation.

My best guess would be a set of sorts of tractable size followed by a set of merges.

a warp-wide sort is here but it only sorts 32 values. However it could be the precursor to a sequence of merges. I’m not sure the merges can be done with zero additional memory, however.

I think an odd-even sort can do sorting with zero extra memory. It should be straightforward to implement at the warp level. There is a sample odd-even sort in the CUDA sample codes. That might be the best fit to use 32 threads with zero extra memory. The most optimal flavor of this may be the batcher’s odd-even mergesort

A bit of googling suggests there are both in-place radix and in-place merges possible. I’m not aware of any GPU implementations or parallel implementations.

Later: I think you could adapt the oddEvenMergeSortShared kernel in the sample code I mentioned: /usr/local/cuda/samples/6_Advanced/sortingNetworks/oddEvenMergeSort.cu to do this. Roughly speaking, you would convert the sync statements to __syncwarp(), and insert a warp-stride loop after each sync statement, before the remaining body of the “odd” and “even” phases. This would also be a bit easier if you could pad your shared data to be a multiple of 32.

95A31 · November 8, 2020, 2:08am

I think in-place radix sort and odd-even sort/mergesort are the most reasonable approaches.

Edit: Thank you for the reference, that is definitely a good starting point.

Topic		Replies	Views
sorting on the GPU CUDA Programming and Performance	2	21452	May 20, 2007
Sort 32 element in a warp CUDA Programming and Performance	5	40	May 6, 2025
Branchless Bitonic Sort with float keys CUDA Programming and Performance	3	882	December 11, 2021
How can i sort an array with CUDA? Who can tell me? CUDA Programming and Performance	5	7265	June 26, 2008
Thread Block level Sort Sorting small arrays CUDA Programming and Performance	3	1098	January 24, 2011
fast sorting of smal arrays WinWidth*WinHeight arrays of size == 32 CUDA Programming and Performance	6	2383	February 5, 2009
Random memory access and += Advice needed CUDA Programming and Performance	4	2413	August 17, 2008
Sorting small arrays Sorting huge amounts of small arrays. CUDA Programming and Performance	0	9092	January 24, 2011
Faulty sort algorithm. Please help! Odd even sort CUDA Programming and Performance	9	11536	July 26, 2008
Which algorithm for sorting many instances of vectors of length 1024? GPU Sorting CUDA Programming and Performance	3	880	December 8, 2011

Sort very small array in shared with 1 warps

Related topics