CUDA speed up for bitonic merge sort speedup

Because I don’t know how to use the CUDPP sort,
So I’ve implement my own bitonic merge sort on CUDA,
and the speed up is about 3x faster than the standard CPU merge sort.

I wonder if there can be a faster version.
Anyone get more than 3x speedup for mergesort?


It is not necessary to write your own merge sort to implement bitonic sort.

This link gives several approaches on how to implement bitonic sort:…itonic_sort.htm

The second method can be used in cuda.

The host code looks like this:

const int MAX_THREAD = 128;

int j, k;

for (k = 2; k <= total; k <<= 1)

    for (j=k>>1; j>0; j=j>>1)

        BitonicSort<<<total/MAX_THREAD, MAX_THREAD>>>(dvalues, j, k);

And the kernel function is:

global static void BitonicSort(int * values, int j, int k)


unsigned int i = threadIdx.x + blockDim.x * blockIdx.x;

int ixj = i^j;

if ((ixj)>i){

 if ((i&k)==0 && values[i]>values[ixj]) {


   int temp = values[i];

   values[i] = values[ixj];

   values[ixj] = temp;


 if ((i&k)!=0 && values[i]<values[ixj]){

   //              exchange(i,ixj);

   int temp = values[i];

   values[i] = values[ixj];

   values[ixj] = temp;




This code can sort arbitrary number of elements(must be the power of 2), not like the example project shipped in cuda sdk. Some tuning needs to be made to handle arbitrary number.

It runs farily fast. As far as I remember, it is even faster than cudpp.