I have a large number of data vectors (1M or even 10M vectors)
float data[1024];
I need to sort the elements in each vector. (I.e., i am NOT ordering the vectors relative to each other – I am sorting the elements of each vector independently of the other vectors).
What is the recommended algorithm for sort in this case for GPU? Is there sample code that can be downloaded?
This would take considerable effort, but 1024 values per thread block is a perfect size for the radix sort I wrote. It’s open source and has very detailed documents: http://www.moderngpu.com/sort/mgpusort.html
What you would need to do is ignore the first two phases (the count/upsweep and the histogram/reduce parts) and basically copy my sort pass and hack on it to solve your particular problem. The sort pass docs are here: http://www.moderngpu.com/sort/sort.html
If your values are 32 bits each, this is what i would do:
copy my 2- and 3-bit sort code. You would use 10 3-bit sorts and a 2-bit sort to fully sort the values. All you would do is launch a grid of threadblocks that are 128 threads/block. Each thread loads 8 values from memory. Follow the code I already have to transpose the elements from strided order to thread order. Then call SortAndScatter from sortgeneric.cu 11 times, each time changing the bits you are sorting.
You’d really need to loop the 3-bit sorts to keep the instruction count down. Doing this should not be very hard… it is certainly easier than writing your own bitonic sort. Is it faster than bitonic? Probably a lot faster. Bitonic is O(n log^2 n)… So 1024 elements have like 1024 * 10 * 10 operations. Radix sort is O(k n)… you have 11 passes here so it’s 1024 * 11. It looks much more efficient, but the devil is in the details. But either way… this would sort at an incredible rate. Guessing you could sort around 4 billion values per second (grouped into vectors of size of 1024) this way.
I might add this functionality to my library (probably for sizes 512, 1024, 2048), but I won’t have the time this month.
Just as an fyi to anyone that is interested; I implemented this and get about 2.4 Billion values per second for a full 32 bit sort on a GTX570. (1024 values per block).
For 2048 values per block (NUM_THREADS = 256) I get 2.1 Billion /sec and for 512 values per block (NUM_THREADS = 64) I get 2.0 Billion / sec.
Regarding the decrease in performance for 512 values per block my guess is that generally 64 threads is too small for optimal performance, perhaps there is a better way of supporting 512 values per block than changing the number of threads, but changing the VALUES_PER_THREAD to 4 from 8 caused a launch failure, the cause of which I haven’t yet tracked down.