Because I don’t know how to use the CUDPP sort,
So I’ve implement my own bitonic merge sort on CUDA,
and the speed up is about 3x faster than the standard CPU merge sort.

I wonder if there can be a faster version.
Anyone get more than 3x speedup for mergesort?

const int MAX_THREAD = 128;
int j, k;
for (k = 2; k <= total; k <<= 1)
for (j=k>>1; j>0; j=j>>1)
BitonicSort<<<total/MAX_THREAD, MAX_THREAD>>>(dvalues, j, k);

And the kernel function is:

global static void BitonicSort(int * values, int j, int k)

{

unsigned int i = threadIdx.x + blockDim.x * blockIdx.x;

int ixj = i^j;

if ((ixj)>i){

if ((i&k)==0 && values[i]>values[ixj]) {
//exchange(i,ixj);
int temp = values[i];
values[i] = values[ixj];
values[ixj] = temp;
}
if ((i&k)!=0 && values[i]<values[ixj]){
// exchange(i,ixj);
int temp = values[i];
values[i] = values[ixj];
values[ixj] = temp;
}

}

}

This code can sort arbitrary number of elements(must be the power of 2), not like the example project shipped in cuda sdk. Some tuning needs to be made to handle arbitrary number.

It runs farily fast. As far as I remember, it is even faster than cudpp.