Sorting float2 and float arrays


I’m porting an evolutionary optimization from the CPU to GPU. Everything works fine and very fast so far. But now I’ve a problem when sorting the result arrays.

I’ve 2 arrays. The first array is float2 and has 130.000 Elements, the second one is float and has 10.000 elements

float2 *a // 130K elements

float *b  //	10K elements

Each element of array b belongs to 13 elements of array a.

(Array a contains 13 points of a Bezier based geometry, array b contains the calculatet area covert by this geometry - and I’ve to sort the bezier array by the best area values)

a[0]   - a[12] -> b[0]

a[13] - a[25] -> b[1]


Now I’ve to sort both according to the value size of array b

b[0] = 4000; 

a[0].x  = 4.7;


a[12].x = 7.8;

b[1] = 3000;

a[13].x  = 10.5;


a[25].x = 11.6;

b[2] = 5000;

a[26].x  = 15.3;


a[38].x = 16.9;

// ---- sortet

b[0] = 5000; 

a[0].x  = 15.3;


a[12].x = 16.9;

b[1] = 4000;

a[13].x  = 4.7;


a[25].x = 7.8;

b[2] = 3000;

a[26].x  = 10.5;


a[38].x = 11.6;

I hope u understand what I mean.

I looked at the examples of radix and bitonic sort…I also checked gpuqsort. But all algorithmes only sort one list with standard datatypes by value.

My idea was to sort the float array and only save the sortet indecies into an new list. But my method (by reduction) only does 1024 elements. And 130K or at least 10K elements are far to much for shared memory.

Any further ideas?

It would also be possible to save both list in to one float3 array, if this would help.


Quick and dirty, not the most efficient, but easy.

Use the fact that the radix sort is a stable sort. So two (or more) keys with the same value will keep their relative order.
So attach the key value b to EACH of your 13 a values. Then sort the whole mess.
Your values will now be in order, with the 13 values still sequentially enumerated. So this is a sort of 130,000 key-data pairs, even though your true problem is only a 10K key sort with a fat data payload.

It’s not as efficient as it could be because you’re using none of your special knowlege of the one key / 13 data setup, but it’d work and would be easy.
If the performance isn’t good enough, then you start thinking about customizing more.

Sorting this little data on the GPU isn’t as efficient as it is on the CPU, though. The GPU starts to do best at 1M+.

Thx for ur fast response.

I’m not sure how to use the keyvalues, but I’ll try it. It mustn’t be that ultra fast, till now the gpu ist about 1.700 times faster than a Phenom x4, some performance loose won’t harm that much. The main thing is, that it should be faster than converting to cpu, sort there and load it back to the GPU.

I will report here, if it works.


Ok, after almost 6 hours, I’ve to admit, that porting the radixsort sample of the sdk is to hard for me. Wheter I understand was is happening nor do I get what is important for my use. And copying the whole thing and just replace the given arrays with mine, doesn’t function either. So I tried CUDPPSORT. But as soon as I use CUDPPResult, all my cuda float2 arrays can’t be initialised anymore?! I can’t frigure out what exactly the problem is. But does anyone knows, if cudppsort can take float2 arrays? The documentation is very poor.

Is there any other way to get a simple radixsort, like kind of just one function to call?

Thx again.

Use thrust::sort_by_key.


Is there a way to get an already allocated and filled device array into thrust::device_vector? All examlples I found so far copy host vectors to the device.

Ah ok, I got it.

You’ve to use:

#include <thrust/sort.h>

float2 *a;

float *b;

cudaMalloc( (void**) &a, sizeof(float2)*numElements);

cudaMalloc( (void**) &b, sizeof(float)*numElements);

thrust::device_ptr<float2> thrust_a_pointer(a);

thrust::device_ptr<float> thrust_b_pointer(b);


thrust::sorting::radix_sort_by_key( thrust_b_pointer, thrust_b_pointer+numElements, thrust_a_pointer);

It’s not that fast, but I does what i need. Thx very much for the hint!