Sort on GPU Need some help to use sorts...


I’m a student from University of Liège. I’m trying to use CUDA to speed up the “decision tree” algorithm (the current version is only sequential). This algorithm is mostly used in a lot of bio-informatic applications.

At the moment, I’m trying to find a sort algorithm on GPU that would be faster than a “Quicksort on CPU”.

In memory, I have to sort a very large float vector.

I have seen some interessant algorithm, like :

  • bitonic sort

  • abi sort

  • radix sort

  • tera sort

I read in the forum that the “radix sort” given with the SDK is very efficient (in the project named “particles”). But I don’t understand how to use it…


//! Perform a radix sort

//! Sorting performed in place on passed arrays.


//! @param pData0       input and output array - data will be sorted

//! @param pData1       additional array to allow ping pong computation

//! @param elements     number of elements to sort


void RadixSort(KeyValuePair *pData0, KeyValuePair *pData1, uint elements, uint bits);

If I have a float Vector “float* vector”, how can I send to RadixSort() a “KeyValuePair *pData0” ?

Would you know any better algorithms to sort large float vectors ?

The bitonic sort seems to be working fine with a little vector (and with size%2 == 0) but if I really want to use it, I need to find

an example of CUDA implementation of the ABisort… Would you know where I could find it ?

Thank you for your help, and keep in mind that i’m a newbie :P!

What I understood is that the radixsort is sorting a data (key) based on a value (value)
So for you you have no key, just a value. I am not sure if it is easy to adjust the radix-sort algorithm to get rid of the data.

Thank you for your useful informations DenisR :)!

I don’t understand how the keys are used by the radix sort… I could set a unique key to each float by given the position of each float in the vector. But I’ve not been surprised to see this solution not working… How could I set these keys to try the efficiency of this algorithm in my application?

If radix sort is not efficient (or not usable), I have to reimplement the Bitonic sort to make it compatible for vector size != 2^n and next implement the (complex) ABisort to use bitonic sort with more than 512 elements. Ok, let’s work D1mmu! :fear:

My last question is : would you know any “easy to use” CUDA sorts function compatible with a large number of element?

CUDPP is a library with scan, reduction and I believe sorting primitives.

Thank you! So I’m going to learn how to use this library! I will post here my solution if it works fine.

The CUDPP lastest 1.0 alpha realse does not support key-value pair sorting, unless you read the kernel and hack it in your way. But an interesting point is that:

When I tried to compare the performance of the original radix sort in particle example with the new CUDPP sort primitive, I thought CUDPP should perform better than radix sort in SDK example because it’s a newer implementation and it only reads the key while the radix sort example reads both key and value…,but finally I found that CUDPP is actually slower than the old version. The performance of sort primitive in CUDPP seems to be slower than previous version…Well, maybe the reason is…it’s the “alpha version”… <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ />

Any one can sheet a light on why it is slower. CUDPP is highly optimized with CUDA prefix sum version, so why it is slower, this does not make sense to me

I tryied CUDPP and its sort function. It seems to be working fine for positive int and floats. The program totally crashes when I put negative values…

In the documentation, I read :

But does someone now if there is a way to use it with unsigned values?

option 1 : take the code and make it work with negative values
option 2 : search for the minimum element of the array (reduction),substract that from the entire array, perform the sort, add the minimum element to the entire array.

At the moment, I’m testing my implementation with “the option 2”, but my DB is large… I have to do some test…

For information, I did a little benchmark beetwen the Qsort in CPU and RadixSort (from CUDPP) in GPU, I get the following results ( element[i] = (float) rand()%nbElements; ):

(CPU : E6750 and GeForce 8800 gts 640)

For 10000 elements -> CPU : 3 ms and GPU : 0 ms

For 100000 elements -> CPU : 41 ms and GPU : 3 m

For 1000000 elements -> CPU : 391 ms and GPU : 36 ms

For 10000000 elements -> CPU : 4166 ms and GPU : 454 ms

The best CUDA parallel sort algorithm I know is the one implemented by Alan Kaatz.
It performs sort in a set of 16000000 elements (16 million elements) in just 311.35 (ms).

This is the link for his code:…

Maybe you should ask for his permission if you want to use it.

For 1 million elements, it took about 30 ms. (tested by me)

Good luck! :)

Thank you ! I tried this sort and it is very fast, but in ma case, I have to sort my matrix by attributes…

I have the following float matrix (in fact, it’s juste a large malloc) :

Record 1 : Att11 Att12 Att13 … Y1 (Y is the conclusion about the other attributes of the record)

Record 2 : Att21 Att22 Att23 … Y2

Record 3 : Att31 Att32 Att33 … Y3


In my algorithm, I have to sort each records by a defined Attribute. So I first tryed to transpose my matrix, to have one attribute by line. And then I would use the efficient function you showed me to sort only line by line…

Att1 : Rec11 Rec12 … Rec1n

Att2 : Rec21 Rec22 … Rec2n

Att3 : …

Y : Y1 Y2 … Yn

But I forgot that I can’t tranpose my matrix… because after a sort, my Y values would not be with her concerned Rec…

My solution would be to have a sort function as portable as qsort(), in wich I could give the size of my record (sizeof(float) * nb_attributes) and a function where I could define the specifics comparisons to do…

So I don’t see any other solutions… I must try to implement the ABIsort for my use case. :unsure:

Thank you for your help !

I’m trying to implement my own sort (king of bitonic sort).

At the moment, I have (512 elem) sub-matrixes sorted and I would like to merge them, would you know a parrallel merge function that takes two matrixes and fills a third one?

Ma : [ 1 2 4 5 ]
Mb : [ 2 5 6 7 ]
Mresult : [ 1 2 2 4 5 5 6 7]

Dear all,

I am trying to do radixsort with the following KeyValuePair:
typedef struct KeyValuePair {
float key;
float value;

or more precisely:
typedef struct KeyValuePair {
float key;
int value;

or even more complex:
typedef struct KeyValuePair {
float key;
int *value;

I looked at the options as suggested in this thread of post:

  1. radixsort in the ‘particles’ example’s KeyValuePair is:
    typedef struct KeyValuePair {
    uint key;
    uint value;
    radixsort is only for integers. I guess it won’t work with my case. Is there any other ways?

I don’t quite understand how radixsort works in this example. Why does it look so complex? Why does the element needs to round up to multiples of 3072?

  1. cudpp has a RADIXSORT. But it does not look like it supports KeyValuePair sorting. It will sort an array with a value.

I only need to sort less than 10000 such elements. Will it be worthy to use GPU for this task?

Thank you,

radix can be used for arbitrary key length. You should understand the mechanism CPU radix first before making your own GPU version, it is not too hard to make the change, but it is hard to make it fast. Good luck

There’s explanation inside the source code. There’s also related chapter in GPU Gems III

In that case the CPU sort will be tens time faster than GPUs. Normally, GPU sort is only faster with more than 1M records

Hey everyone, I’ve been trying to use the sort algorithm from the particles example inside my own project. After some modifications to work with positive floating point data, I’m not getting good performance results.

To give you an idea, here are some numbers:

number of keyvalue pairs ____ time (ms)
10.000 ___________________ 1.65
100.000 __________________ 6.94
1.000.000 ________________ 95.20

As far as I’ve checked, the results are correctly sorted. Anyone can tell me if I just messed up the original code? Or is this the performance I should expect anyway?

Btw, I’m testing on a 9800GX2, but i get similar results on a 8800GTX.