Sorting in CUDA Is sorting in CUDA worth the trouble?

ajdavis · July 29, 2009, 9:18pm

The last few days I’ve been looking at sort routines in CUDA and am afraid it may not be worth the trouble. Hopefully, you to talk me down on this since there has been some heroic efforts in this area that certainly deserves consideration.

I havenâ€™t seen an example of how to sort a structure of several elements. Even if there were, I’m thinking it could eat up the 16 KB processor memory limit pretty fast.
I haven’t seen a way to recover the original index from the sorted index. I thought about operating on the pre-fix scan to back out shifts that were performed at each level but this would take a much better programmer than I. Appending the initial grid-block-thread index in the keyword could make this much easier but then I’d need to solve issue 1).
Moving the keywords around after scanning each chunk is pretty wasteful of resources and may make the efficiency go way down relative to the CPU. Making the keyword bigger would not help the issue.
I took a look at the radix sort in CUDPP and really appreciate the heroic efforts made in improving the performance, but there still remains limitations in the number of elements and the size of the keywords that just do not come into play in sorts that use recursive linked lists for example. Also, several hundred lines of code to do what the iterative CPU can do in maybe 50 lines makes GPU sorting much more difficult to debug.

Personally, I donâ€™t think this is a limitation of the CUDA architecture since the CPU is always available to handle thinks that don’t make sense in the CUDA.

Maybe as the technology advances, things like the atomicInc() function could turn this around, but in the mean time, I’m having difficulty recommending the CUDA sort routines at this time. Again, I encourage you to talk me down on this.

Gregory_Diamos · July 29, 2009, 10:19pm

You might want to take a look at thrust ( [url=“Google Code Archive - Long-term storage for Google Code Project Hosting.”]Google Code Archive - Long-term storage for Google Code Project Hosting. ). They have a pretty nice implementation of sorting that uses radixsort on plain old data and merge sort on more complex structures.

e.ping · July 29, 2009, 11:31pm

This paper by Satish et al. describes not only the radix sort that has been included recently into the new CUDPP release, but also a very clever merge sort that might be more suited to what you want to achieve:

[url=“http://gpgpu.org/2009/03/01/designing-efficient-sorting-algorithms-for-manycore-gpus”]http://gpgpu.org/2009/03/01/designing-effi...r-manycore-gpus[/url]

nbell · July 30, 2009, 7:18am

The last few days I’ve been looking at sort routines in CUDA and am afraid it may not be worth the trouble. Hopefully, you to talk me down on this since there has been some heroic efforts in this area that certainly deserves consideration.

I havenâ€™t seen an example of how to sort a structure of several elements. Even if there were, I’m thinking it could eat up the 16 KB processor memory limit pretty fast.

I haven’t seen a way to recover the original index from the sorted index. I thought about operating on the pre-fix scan to back out shifts that were performed at each level but this would take a much better programmer than I. Appending the initial grid-block-thread index in the keyword could make this much easier but then I’d need to solve issue 1).

Moving the keywords around after scanning each chunk is pretty wasteful of resources and may make the efficiency go way down relative to the CPU. Making the keyword bigger would not help the issue.

I took a look at the radix sort in CUDPP and really appreciate the heroic efforts made in improving the performance, but there still remains limitations in the number of elements and the size of the keywords that just do not come into play in sorts that use recursive linked lists for example. Also, several hundred lines of code to do what the iterative CPU can do in maybe 50 lines makes GPU sorting much more difficult to debug.

Personally, I donâ€™t think this is a limitation of the CUDA architecture since the CPU is always available to handle thinks that don’t make sense in the CUDA.

Maybe as the technology advances, things like the atomicInc() function could turn this around, but in the mean time, I’m having difficulty recommending the CUDA sort routines at this time. Again, I encourage you to talk me down on this.

Thrust can sort structs directly (up to some shared memory size limits), but if performance is important I would decompose the structs into separate arrays and perform several key-value sorts. Actually, this is how we currently sort 64-bit types in Thrust. We first sort the lower 32-bits using a key-value (radix) sort to record the necessary permutation info, and then sort the upper 64-bits, again using a key-value sort to track the permutation.

Here’s the (somewhat convoluted) code that handles 64-bit integers and doubles:

http://code.google.com/p/thrust/source/bro…ort_key.inl#148

Also, here’s a small example that shows how one would sort three separate arrays in lexicographical order with Thrust:

http://code.google.com/p/thrust/source/bro…aphical_sort.cu

Here’s some example output:

[codebox]Unsorted Keys

(3,1,9)

(6,8,3)

(7,7,1)

(5,9,9)

(3,2,4)

(5,0,7)

(6,2,8)

(2,3,4)

(9,7,5)

(1,5,0)

(2,9,3)

(7,2,6)

(0,2,1)

(9,8,0)

(3,9,6)

(6,7,3)

(0,3,2)

(6,6,0)

(2,1,6)

(6,2,1)

Sorted Keys

(0,2,1)

(0,3,2)

(1,5,0)

(2,1,6)

(2,3,4)

(2,9,3)

(3,1,9)

(3,2,4)

(3,9,6)

(5,0,7)

(5,9,9)

(6,2,1)

(6,2,8)

(6,6,0)

(6,7,3)

(6,8,3)

(7,2,6)

(7,7,1)

(9,7,5)

(9,8,0)

[/codebox]

If you wrote a kernel that decomposed the relevant fields of your struct into separate arrays with standard C data types (int, float, char, etc.) the approach above would be reasonably efficient. A comparison-based sort might be more efficient if the number of field is large, but for many cases I bet this approach is a net win.

If you want to try the comparison based method, just use thrust::sort() with a user-defined comparison function (i.e. one that defines a < operation between structs). If you also want the permutation vector then use same trick illustrated in the example with thrust::sort_by_key() and sequence(). We haven’t tested the sorting codes with large structs, so don’t be surprised if it blows up :)

Simon_Green · July 30, 2009, 10:58am

Isn’t it always going to be faster to sort just key/value pairs (or perhaps just the indices), and then re-order the full structures into sorted order in a subsequent pass?

ajdavis · July 30, 2009, 4:37pm

Thanks for the reference. I plan to look at it after I get a bit more comfortable with standard template implementation.

ajdavis · July 30, 2009, 4:58pm

That sounds like what I am trying to do. But, I may not fully understand what is meant be key/value pair.

Is the key the thing I am sorting and the value the index into a vector of structure that contains other items of interest?

In other words: If the “key” is a 32 bit integer representing some property of a vector in the 2-D array, is the “value” the index into the array of vectors that has a key greater than the previous key in the sorted list of key/value pairs?

nbell · July 30, 2009, 5:09pm

Yep, the “key” is the only part that affects the sort and the “value” is just something that gets carried along with the key. In the example I use some integer values to keep track of where the sorted keys went. The values can be any data type you want.

If the key array was [6,4,3,5] and the value array was [0,1,2,3] then after calling sort_by_key() the arrays would be [3,4,5,6] and [2,1,3,0]. Another way to look at this is that each key/value at the same location in the array forms a pair (k,v) where k is the key and v is the value. We sort the pairs by key which has the effect of transporting the values to their new locations. In short, thrust::sort_by_key sorts the key array and applies the very same reordering to the value array.

If you can post some code that illustrates your needs, I can show you how to implement it with Thrust.

parallelis · July 30, 2009, 6:04pm

“Sorting with GPU is worthless”, that is the main idea of this thread’s original post.

Not for me, I see cases where my CPU maybe occupied at 100% on different tasks, including sorting large arrays of informations.
I would like to be able to put a middle-class GPU on the computer just to relieve the CPU of these tasks, and be able to gain 20% to 50% overall performance sharing work between CPU and GPU.

I couldnt disclose other informations about this project, but gaining 20% or more performance on a server maybe really worthwile!

ajdavis · July 30, 2009, 6:56pm

Yep, the “key” is the only part that affects the sort and the “value” is just something that gets carried along with the key. In the example I use some integer values to keep track of where the sorted keys went. The values can be any data type you want.

For example, if the key array was [6,4,3,5] and the value array was [0,1,2,3] then after calling sort_by_key() the arrays would be [3,4,5,6] and [2,1,3,0]. Another way to look at this is that each key/value at the same location in the array forms a pair (k,v) where k is the key and v is the value. We sort the pairs by key which has the effect of transporting the values to their new locations. In short, thrust::sort_by_key sorts the key array and applies the very same reordering to the value array.

If you can post some code that illustrates your needs, I can show you how to implement it with Thrust.

nbell,

Let me know if I’m following okay:

I’m looking at your [6,4,3,5] example and the code in http://code.google.com/p/thrust/…&#46…aphical_sort.cu. If I follow, in line 19 thrust::gather appends the index to create “temp” that looks like [60,41,32,53]. Then line 22 thrust::stable_sort_by_key turns “temp” into [32, 41, 53, 60] and returns in “permutation” [2,1,3,0].

Then in line 56,57,and 58 of the code, each set of keys are used with the previous permutation, and line 68 outputs the upper, middle, and lower using the final permutation as an index.

Thanks for your time, I’ll try it!

Aaron

ajdavis · September 23, 2009, 7:44pm

Yep, the “key” is the only part that affects the sort and the “value” is just something that gets carried along with the key. In the example I use some integer values to keep track of where the sorted keys went. The values can be any data type you want.

If the key array was [6,4,3,5] and the value array was [0,1,2,3] then after calling sort_by_key() the arrays would be [3,4,5,6] and [2,1,3,0]. Another way to look at this is that each key/value at the same location in the array forms a pair (k,v) where k is the key and v is the value. We sort the pairs by key which has the effect of transporting the values to their new locations. In short, thrust::sort_by_key sorts the key array and applies the very same reordering to the value array.

If you can post some code that illustrates your needs, I can show you how to implement it with Thrust.

Hi nbell,

It took a while but I finally got to trying out the sort_by_key on my data set.

The only change to “lexicographical_sort.cu” was to use 2 rather than 3 “update_permutations” calls.

The code produces the correct results on my 544k element arrays but it takes 20,105.8 ms. This is way too much time.

The radixSort routine in the CUDA SDK takes only 16.7 ms to sort 1,048,576 values. However, I need the “sort_by_key” feature for my problem.

Is there any way to tune-up thrust to work at comparable speed to the radixSort example?

Thanks,

Aaron

SPWorley · September 23, 2009, 7:55pm

The radixsort example provides 4 sort routines. 2 are for sorting by (float/integer) keys only and 2 are for sorting by (float/integer) key/data pairs.

eelsen · September 23, 2009, 9:32pm

There is no way it should take anywhere near 20 seconds to perform the sort you need. Sorting key-value instead of just key shouldn’t be very much slower.

If you have this:

struct pt

{

   float x, y, z;

};

vector<pt> points; //somehow fill up with points

//each point is somehow associated with a bin

//say that a function determineBin exists which figures this out

//and then wish to sort the points using the bin as the key

//in the STL way, you would do this:

bool myComparator(const pt & left, const pt & right) {

   return(determineBin(left) < determineBin(right));

}

sort(points.begin(), points.end(), myComparator);

it would be better to do like this for the gpu

vector<float> x;

vector<float> y;

vector<float> z;

//somehow fill up point data

//generate the bin data

//could be nicer using std::transform and boost/thrust zip_iterators

//but I'm doing it this way since the other way might be confusing at first

//if you've never seen those used before

vector<int> bins;

for(int i = 0; i < numPoints; ++i) 

   bins.push_back( determineBin(x[i], y[i], z[i]) );

//and also a value array which is just the index

//again, could be nicer using boost/thurst counting_iterators

//but doing it this way for clarity in the case of unfamiliarity

vector<int> value;

for (int i = 0; i < numPoints; ++i) 

   value.push_back(i);

//then sort the (bin, value) pairs

sort(bins, value);

//and reshuffle the x, y, z data according to the sorted value pairs

for (int i = 0; i < numPoints; ++i) {

  xnew.push_back( x[value[i] ]);

  ynew.push_back( y[value[i] ]);

  znew.push_back( z[value[i] ]);

}

You shouldn’t have to modify any of the sorting code directly. You could do this the “hard” way using CUDPP or using thrust as previously mentioned. How big is the structure you need to sort?

ajdavis · September 23, 2009, 9:47pm

Right, the next step is to implement radixsort as was done in the particle sdk.

I have no idea why the <trust/sort.h> was over 20 sec. on 544k elements.

JaredHoberock · September 23, 2009, 9:57pm

Would you mind posting the code that reproduces this slow sort? If performance is really this slow, we need to fix this bug.

ajdavis · September 30, 2009, 4:02pm

I have since implemented (and am happy with) the radixSort from the SDK and may have deleted the thrust version. I’ll look around for a backup.

Topic		Replies	Views
Code: efficient sorting with CUDA CUDA Programming and Performance	40	62041	December 20, 2011
SRTS Radix Sort: High Performance and Scalable GPU Radix Sorting CUDA Programming and Performance	58	19397	May 27, 2013
Sort on GPU Need some help to use sorts... CUDA Programming and Performance	15	42108	June 19, 2008
find the top Nth biggest of the elements CUDA Programming and Performance	12	5882	November 29, 2009
sorting data already in GPU? thrust maybe? CUDA Programming and Performance	5	1141	July 6, 2011
Thrust, Radix Sort and uint2 Key-Value Pairs How to sort uint2 pairs with thrust? CUDA Programming and Performance	5	2785	January 30, 2012
sorting 2d array in CUDA CUDA Programming and Performance	5	14109	February 15, 2011
Fastest sorting algorithm on GPU currently CUDA Programming and Performance	20	46139	May 2, 2021
sort a matrix row wise thrust sort is slower on gpu than its cpu counterpart CUDA Programming and Performance	4	1980	February 4, 2012
Question about sorting using CUDA CUDA Programming and Performance	1	3690	June 9, 2010

Sorting in CUDA Is sorting in CUDA worth the trouble?

Related topics