Sorting in CUDA Is sorting in CUDA worth the trouble?

The last few days I’ve been looking at sort routines in CUDA and am afraid it may not be worth the trouble. Hopefully, you to talk me down on this since there has been some heroic efforts in this area that certainly deserves consideration.

  1. I haven’t seen an example of how to sort a structure of several elements. Even if there were, I’m thinking it could eat up the 16 KB processor memory limit pretty fast.

  2. I haven’t seen a way to recover the original index from the sorted index. I thought about operating on the pre-fix scan to back out shifts that were performed at each level but this would take a much better programmer than I. Appending the initial grid-block-thread index in the keyword could make this much easier but then I’d need to solve issue 1).

  3. Moving the keywords around after scanning each chunk is pretty wasteful of resources and may make the efficiency go way down relative to the CPU. Making the keyword bigger would not help the issue.

  4. I took a look at the radix sort in CUDPP and really appreciate the heroic efforts made in improving the performance, but there still remains limitations in the number of elements and the size of the keywords that just do not come into play in sorts that use recursive linked lists for example. Also, several hundred lines of code to do what the iterative CPU can do in maybe 50 lines makes GPU sorting much more difficult to debug.

Personally, I don’t think this is a limitation of the CUDA architecture since the CPU is always available to handle thinks that don’t make sense in the CUDA.

Maybe as the technology advances, things like the atomicInc() function could turn this around, but in the mean time, I’m having difficulty recommending the CUDA sort routines at this time. Again, I encourage you to talk me down on this.

You might want to take a look at thrust ( http://code.google.com/p/thrust/ ). They have a pretty nice implementation of sorting that uses radixsort on plain old data and merge sort on more complex structures.

This paper by Satish et al. describes not only the radix sort that has been included recently into the new CUDPP release, but also a very clever merge sort that might be more suited to what you want to achieve:

http://gpgpu.org/2009/03/01/designing-effi…r-manycore-gpus

Thrust can sort structs directly (up to some shared memory size limits), but if performance is important I would decompose the structs into separate arrays and perform several key-value sorts. Actually, this is how we currently sort 64-bit types in Thrust. We first sort the lower 32-bits using a key-value (radix) sort to record the necessary permutation info, and then sort the upper 64-bits, again using a key-value sort to track the permutation.

Here’s the (somewhat convoluted) code that handles 64-bit integers and doubles:

http://code.google.com/p/thrust/source/bro…ort_key.inl#148

Also, here’s a small example that shows how one would sort three separate arrays in lexicographical order with Thrust:

http://code.google.com/p/thrust/source/bro…aphical_sort.cu

Here’s some example output:

[codebox]Unsorted Keys

(3,1,9)

(6,8,3)

(7,7,1)

(5,9,9)

(3,2,4)

(5,0,7)

(6,2,8)

(2,3,4)

(9,7,5)

(1,5,0)

(2,9,3)

(7,2,6)

(0,2,1)

(9,8,0)

(3,9,6)

(6,7,3)

(0,3,2)

(6,6,0)

(2,1,6)

(6,2,1)

Sorted Keys

(0,2,1)

(0,3,2)

(1,5,0)

(2,1,6)

(2,3,4)

(2,9,3)

(3,1,9)

(3,2,4)

(3,9,6)

(5,0,7)

(5,9,9)

(6,2,1)

(6,2,8)

(6,6,0)

(6,7,3)

(6,8,3)

(7,2,6)

(7,7,1)

(9,7,5)

(9,8,0)

[/codebox]

If you wrote a kernel that decomposed the relevant fields of your struct into separate arrays with standard C data types (int, float, char, etc.) the approach above would be reasonably efficient. A comparison-based sort might be more efficient if the number of field is large, but for many cases I bet this approach is a net win.

If you want to try the comparison based method, just use thrust::sort() with a user-defined comparison function (i.e. one that defines a < operation between structs). If you also want the permutation vector then use same trick illustrated in the example with thrust::sort_by_key() and sequence(). We haven’t tested the sorting codes with large structs, so don’t be surprised if it blows up :)

Isn’t it always going to be faster to sort just key/value pairs (or perhaps just the indices), and then re-order the full structures into sorted order in a subsequent pass?

Thanks for the reference. I plan to look at it after I get a bit more comfortable with standard template implementation.

That sounds like what I am trying to do. But, I may not fully understand what is meant be key/value pair.

Is the key the thing I am sorting and the value the index into a vector of structure that contains other items of interest?

In other words: If the “key” is a 32 bit integer representing some property of a vector in the 2-D array, is the “value” the index into the array of vectors that has a key greater than the previous key in the sorted list of key/value pairs?

Yep, the “key” is the only part that affects the sort and the “value” is just something that gets carried along with the key. In the example I use some integer values to keep track of where the sorted keys went. The values can be any data type you want.

If the key array was [6,4,3,5] and the value array was [0,1,2,3] then after calling sort_by_key() the arrays would be [3,4,5,6] and [2,1,3,0]. Another way to look at this is that each key/value at the same location in the array forms a pair (k,v) where k is the key and v is the value. We sort the pairs by key which has the effect of transporting the values to their new locations. In short, thrust::sort_by_key sorts the key array and applies the very same reordering to the value array.

If you can post some code that illustrates your needs, I can show you how to implement it with Thrust.

“Sorting with GPU is worthless”, that is the main idea of this thread’s original post.

Not for me, I see cases where my CPU maybe occupied at 100% on different tasks, including sorting large arrays of informations.
I would like to be able to put a middle-class GPU on the computer just to relieve the CPU of these tasks, and be able to gain 20% to 50% overall performance sharing work between CPU and GPU.

I couldnt disclose other informations about this project, but gaining 20% or more performance on a server maybe really worthwile!

nbell,

Let me know if I’m following okay:

I’m looking at your [6,4,3,5] example and the code in http://code.google.com/p/thrust/…&#46…aphical_sort.cu. If I follow, in line 19 thrust::gather appends the index to create “temp” that looks like [60,41,32,53]. Then line 22 thrust::stable_sort_by_key turns “temp” into [32, 41, 53, 60] and returns in “permutation” [2,1,3,0].

Then in line 56,57,and 58 of the code, each set of keys are used with the previous permutation, and line 68 outputs the upper, middle, and lower using the final permutation as an index.

Thanks for your time, I’ll try it!

Aaron

Hi nbell,

It took a while but I finally got to trying out the sort_by_key on my data set.

The only change to “lexicographical_sort.cu” was to use 2 rather than 3 “update_permutations” calls.

The code produces the correct results on my 544k element arrays but it takes 20,105.8 ms. This is way too much time.

The radixSort routine in the CUDA SDK takes only 16.7 ms to sort 1,048,576 values. However, I need the “sort_by_key” feature for my problem.

Is there any way to tune-up thrust to work at comparable speed to the radixSort example?

Thanks,

Aaron

The radixsort example provides 4 sort routines. 2 are for sorting by (float/integer) keys only and 2 are for sorting by (float/integer) key/data pairs.

There is no way it should take anywhere near 20 seconds to perform the sort you need. Sorting key-value instead of just key shouldn’t be very much slower.

If you have this:

struct pt

{

   float x, y, z;

};

vector<pt> points; //somehow fill up with points

//each point is somehow associated with a bin

//say that a function determineBin exists which figures this out

//and then wish to sort the points using the bin as the key

//in the STL way, you would do this:

bool myComparator(const pt & left, const pt & right) {

   return(determineBin(left) < determineBin(right));

}

sort(points.begin(), points.end(), myComparator);

it would be better to do like this for the gpu

vector<float> x;

vector<float> y;

vector<float> z;

//somehow fill up point data

//generate the bin data

//could be nicer using std::transform and boost/thrust zip_iterators

//but I'm doing it this way since the other way might be confusing at first

//if you've never seen those used before

vector<int> bins;

for(int i = 0; i < numPoints; ++i) 

   bins.push_back( determineBin(x[i], y[i], z[i]) );

//and also a value array which is just the index

//again, could be nicer using boost/thurst counting_iterators

//but doing it this way for clarity in the case of unfamiliarity

vector<int> value;

for (int i = 0; i < numPoints; ++i) 

   value.push_back(i);

//then sort the (bin, value) pairs

sort(bins, value);

//and reshuffle the x, y, z data according to the sorted value pairs

for (int i = 0; i < numPoints; ++i) {

  xnew.push_back( x[value[i] ]);

  ynew.push_back( y[value[i] ]);

  znew.push_back( z[value[i] ]);

}

You shouldn’t have to modify any of the sorting code directly. You could do this the “hard” way using CUDPP or using thrust as previously mentioned. How big is the structure you need to sort?

Right, the next step is to implement radixsort as was done in the particle sdk.

I have no idea why the <trust/sort.h> was over 20 sec. on 544k elements.

Would you mind posting the code that reproduces this slow sort? If performance is really this slow, we need to fix this bug.

I have since implemented (and am happy with) the radixSort from the SDK and may have deleted the thrust version. I’ll look around for a backup.