Which algorithm for sorting many instances of vectors of length 1024? GPU Sorting

jm1 · October 13, 2011, 9:07pm

I have a large number of data vectors (1M or even 10M vectors)
float data[1024];

I need to sort the elements in each vector. (I.e., i am NOT ordering the vectors relative to each other – I am sorting the elements of each vector independently of the other vectors).

What is the recommended algorithm for sort in this case for GPU? Is there sample code that can be downloaded?

thanks

bunnyFair · October 14, 2011, 12:54am

I saw an interesting concurrent sort the other day called Bitonic. Since we’re all about parallel here, worth a try huh?

SeanB · October 14, 2011, 6:34am

This would take considerable effort, but 1024 values per thread block is a perfect size for the radix sort I wrote. It’s open source and has very detailed documents:
http://www.moderngpu.com/sort/mgpusort.html

What you would need to do is ignore the first two phases (the count/upsweep and the histogram/reduce parts) and basically copy my sort pass and hack on it to solve your particular problem. The sort pass docs are here:
http://www.moderngpu.com/sort/sort.html

If your values are 32 bits each, this is what i would do:

copy my 2- and 3-bit sort code. You would use 10 3-bit sorts and a 2-bit sort to fully sort the values. All you would do is launch a grid of threadblocks that are 128 threads/block. Each thread loads 8 values from memory. Follow the code I already have to transpose the elements from strided order to thread order. Then call SortAndScatter from sortgeneric.cu 11 times, each time changing the bits you are sorting.

You’d really need to loop the 3-bit sorts to keep the instruction count down. Doing this should not be very hard… it is certainly easier than writing your own bitonic sort. Is it faster than bitonic? Probably a lot faster. Bitonic is O(n log^2 n)… So 1024 elements have like 1024 * 10 * 10 operations. Radix sort is O(k n)… you have 11 passes here so it’s 1024 * 11. It looks much more efficient, but the devil is in the details. But either way… this would sort at an incredible rate. Guessing you could sort around 4 billion values per second (grouped into vectors of size of 1024) this way.

I might add this functionality to my library (probably for sizes 512, 1024, 2048), but I won’t have the time this month.

sean

eelsen · December 8, 2011, 1:53am

This would take considerable effort, but 1024 values per thread block is a perfect size for the radix sort I wrote. It’s open source and has very detailed documents:

http://www.moderngpu.com/sort/mgpusort.html

What you would need to do is ignore the first two phases (the count/upsweep and the histogram/reduce parts) and basically copy my sort pass and hack on it to solve your particular problem. The sort pass docs are here:

http://www.moderngpu.com/sort/sort.html

If your values are 32 bits each, this is what i would do:

copy my 2- and 3-bit sort code. You would use 10 3-bit sorts and a 2-bit sort to fully sort the values. All you would do is launch a grid of threadblocks that are 128 threads/block. Each thread loads 8 values from memory. Follow the code I already have to transpose the elements from strided order to thread order. Then call SortAndScatter from sortgeneric.cu 11 times, each time changing the bits you are sorting.

You’d really need to loop the 3-bit sorts to keep the instruction count down. Doing this should not be very hard… it is certainly easier than writing your own bitonic sort. Is it faster than bitonic? Probably a lot faster. Bitonic is O(n log^2 n)… So 1024 elements have like 1024 * 10 * 10 operations. Radix sort is O(k n)… you have 11 passes here so it’s 1024 * 11. It looks much more efficient, but the devil is in the details. But either way… this would sort at an incredible rate. Guessing you could sort around 4 billion values per second (grouped into vectors of size of 1024) this way.

I might add this functionality to my library (probably for sizes 512, 1024, 2048), but I won’t have the time this month.

sean

Just as an fyi to anyone that is interested; I implemented this and get about 2.4 Billion values per second for a full 32 bit sort on a GTX570. (1024 values per block).

For 2048 values per block (NUM_THREADS = 256) I get 2.1 Billion /sec and for 512 values per block (NUM_THREADS = 64) I get 2.0 Billion / sec.

Regarding the decrease in performance for 512 values per block my guess is that generally 64 threads is too small for optimal performance, perhaps there is a better way of supporting 512 values per block than changing the number of threads, but changing the VALUES_PER_THREAD to 4 from 8 caused a launch failure, the cause of which I haven’t yet tracked down.

Topic		Replies	Views
sorting on the GPU CUDA Programming and Performance	2	21422	May 20, 2007
How can i sort an array with CUDA? Who can tell me? CUDA Programming and Performance	5	7248	June 26, 2008
Sorting float2 and float arrays CUDA Programming and Performance	6	6952	September 6, 2009
Sort on GPU Need some help to use sorts... CUDA Programming and Performance	15	41897	June 19, 2008
Btionic Sort in sdk Demo what's the bottleneck? CUDA Programming and Performance	3	2600	June 26, 2008
Radix Sort CUDA Programming and Performance	2	898	April 7, 2014
GPU sort CUDA Programming and Performance	6	2938	September 5, 2008
Random memory access and += Advice needed CUDA Programming and Performance	4	2404	August 17, 2008
K largest elements for vector of length N CUDA Programming and Performance	16	3961	September 30, 2009
Branchless Bitonic Sort with float keys CUDA Programming and Performance	3	850	December 11, 2021

Which algorithm for sorting many instances of vectors of length 1024? GPU Sorting

Related topics