The last few days I’ve been looking at sort routines in CUDA and am afraid it may not be worth the trouble. Hopefully, you to talk me down on this since there has been some heroic efforts in this area that certainly deserves consideration.
I haven’t seen an example of how to sort a structure of several elements. Even if there were, I’m thinking it could eat up the 16 KB processor memory limit pretty fast.
I haven’t seen a way to recover the original index from the sorted index. I thought about operating on the pre-fix scan to back out shifts that were performed at each level but this would take a much better programmer than I. Appending the initial grid-block-thread index in the keyword could make this much easier but then I’d need to solve issue 1).
Moving the keywords around after scanning each chunk is pretty wasteful of resources and may make the efficiency go way down relative to the CPU. Making the keyword bigger would not help the issue.
I took a look at the radix sort in CUDPP and really appreciate the heroic efforts made in improving the performance, but there still remains limitations in the number of elements and the size of the keywords that just do not come into play in sorts that use recursive linked lists for example. Also, several hundred lines of code to do what the iterative CPU can do in maybe 50 lines makes GPU sorting much more difficult to debug.
Personally, I don’t think this is a limitation of the CUDA architecture since the CPU is always available to handle thinks that don’t make sense in the CUDA.
Maybe as the technology advances, things like the atomicInc() function could turn this around, but in the mean time, I’m having difficulty recommending the CUDA sort routines at this time. Again, I encourage you to talk me down on this.
This paper by Satish et al. describes not only the radix sort that has been included recently into the new CUDPP release, but also a very clever merge sort that might be more suited to what you want to achieve:
Thrust can sort structs directly (up to some shared memory size limits), but if performance is important I would decompose the structs into separate arrays and perform several key-value sorts. Actually, this is how we currently sort 64-bit types in Thrust. We first sort the lower 32-bits using a key-value (radix) sort to record the necessary permutation info, and then sort the upper 64-bits, again using a key-value sort to track the permutation.
Here’s the (somewhat convoluted) code that handles 64-bit integers and doubles:
If you wrote a kernel that decomposed the relevant fields of your struct into separate arrays with standard C data types (int, float, char, etc.) the approach above would be reasonably efficient. A comparison-based sort might be more efficient if the number of field is large, but for many cases I bet this approach is a net win.
If you want to try the comparison based method, just use thrust::sort() with a user-defined comparison function (i.e. one that defines a < operation between structs). If you also want the permutation vector then use same trick illustrated in the example with thrust::sort_by_key() and sequence(). We haven’t tested the sorting codes with large structs, so don’t be surprised if it blows up :)
Isn’t it always going to be faster to sort just key/value pairs (or perhaps just the indices), and then re-order the full structures into sorted order in a subsequent pass?
That sounds like what I am trying to do. But, I may not fully understand what is meant be key/value pair.
Is the key the thing I am sorting and the value the index into a vector of structure that contains other items of interest?
In other words: If the “key” is a 32 bit integer representing some property of a vector in the 2-D array, is the “value” the index into the array of vectors that has a key greater than the previous key in the sorted list of key/value pairs?
Yep, the “key” is the only part that affects the sort and the “value” is just something that gets carried along with the key. In the example I use some integer values to keep track of where the sorted keys went. The values can be any data type you want.
If the key array was [6,4,3,5] and the value array was [0,1,2,3] then after calling sort_by_key() the arrays would be [3,4,5,6] and [2,1,3,0]. Another way to look at this is that each key/value at the same location in the array forms a pair (k,v) where k is the key and v is the value. We sort the pairs by key which has the effect of transporting the values to their new locations. In short, thrust::sort_by_key sorts the key array and applies the very same reordering to the value array.
If you can post some code that illustrates your needs, I can show you how to implement it with Thrust.
“Sorting with GPU is worthless”, that is the main idea of this thread’s original post.
Not for me, I see cases where my CPU maybe occupied at 100% on different tasks, including sorting large arrays of informations.
I would like to be able to put a middle-class GPU on the computer just to relieve the CPU of these tasks, and be able to gain 20% to 50% overall performance sharing work between CPU and GPU.
I couldnt disclose other informations about this project, but gaining 20% or more performance on a server maybe really worthwile!
I’m looking at your [6,4,3,5] example and the code in http://code.google.com/p/thrust/….…aphical_sort.cu. If I follow, in line 19 thrust::gather appends the index to create “temp” that looks like [60,41,32,53]. Then line 22 thrust::stable_sort_by_key turns “temp” into [32, 41, 53, 60] and returns in “permutation” [2,1,3,0].
Then in line 56,57,and 58 of the code, each set of keys are used with the previous permutation, and line 68 outputs the upper, middle, and lower using the final permutation as an index.
The radixsort example provides 4 sort routines. 2 are for sorting by (float/integer) keys only and 2 are for sorting by (float/integer) key/data pairs.
There is no way it should take anywhere near 20 seconds to perform the sort you need. Sorting key-value instead of just key shouldn’t be very much slower.
If you have this:
struct pt
{
float x, y, z;
};
vector<pt> points; //somehow fill up with points
//each point is somehow associated with a bin
//say that a function determineBin exists which figures this out
//and then wish to sort the points using the bin as the key
//in the STL way, you would do this:
bool myComparator(const pt & left, const pt & right) {
return(determineBin(left) < determineBin(right));
}
sort(points.begin(), points.end(), myComparator);
it would be better to do like this for the gpu
vector<float> x;
vector<float> y;
vector<float> z;
//somehow fill up point data
//generate the bin data
//could be nicer using std::transform and boost/thrust zip_iterators
//but I'm doing it this way since the other way might be confusing at first
//if you've never seen those used before
vector<int> bins;
for(int i = 0; i < numPoints; ++i)
bins.push_back( determineBin(x[i], y[i], z[i]) );
//and also a value array which is just the index
//again, could be nicer using boost/thurst counting_iterators
//but doing it this way for clarity in the case of unfamiliarity
vector<int> value;
for (int i = 0; i < numPoints; ++i)
value.push_back(i);
//then sort the (bin, value) pairs
sort(bins, value);
//and reshuffle the x, y, z data according to the sorted value pairs
for (int i = 0; i < numPoints; ++i) {
xnew.push_back( x[value[i] ]);
ynew.push_back( y[value[i] ]);
znew.push_back( z[value[i] ]);
}
You shouldn’t have to modify any of the sorting code directly. You could do this the “hard” way using CUDPP or using thrust as previously mentioned. How big is the structure you need to sort?