# fast sorting of smal arrays WinWidth*WinHeight arrays of size == 32

Hi. I have a large number of arrays (WinWidth*WinHeight exactly). Each array have fixed size == 32.

I need to sort them as fast as it possible.

I wrote buble sort, but it is too slow. What sorting algorithm will be better to use in my case?

btw. my bubble sort.

``````/////////////////////////////////////////////////////////////////////////

////

template

<

int ARR_SIZE, // array size. shold be less or eq that CTA_SIZE

class T,	  // type of array elements

template <class> class LESS  // LESS<T>::exec(A,B) return true if A < B

>

__device__ void BubleSort(T data[])

{

int idInChunck = threadIdx.x % ARR_SIZE;

for(int i=0;i<ARR_SIZE;i++)

{

int index = (idInChunck+i) % ARR_SIZE + chunckOffset;

if(threadIdx.x%2 == 0 && index+1 < ARR_SIZE + chunckOffset)

{

T A = data[index];

T B = data[index+1];

if(!LESS<T>::exec(A,B))

{

data[index] = B;

data[index+1] = A;

}

}

}

}
``````

tnanks.

i forgot to say that elements have size == 8 bytes. And there floting point data.

For small arrays you should be able to use a sorting network (see Knuth). Hopefully nvcc will convert min/max expressions into suitable branchless instruction sequences but I haven’t verified this.

http://en.wikipedia.org/wiki/Sorting_network

thanks, i’ll try to do that.

You want Bitonic sort, especially since it’s already in the CUDA examples.

Sorting networks for N=32 inputs are quite huge and require a lot of comparators (for N=24 it was around to 129 compare-and-swap operations already).

Things get simpler for small arrays: For N=8 inputs I’ve successfully implemented a sorting network in an OpenGL ARBfp1.0 fragment shader. ;)

Christian

so, you think that sorting network will be too huge, may be you have another idea?