fast sorting of smal arrays WinWidth*WinHeight arrays of size == 32

Hi. I have a large number of arrays (WinWidth*WinHeight exactly). Each array have fixed size == 32.

I need to sort them as fast as it possible.

I wrote buble sort, but it is too slow. What sorting algorithm will be better to use in my case?

btw. my bubble sort.

/////////////////////////////////////////////////////////////////////////

////

template

< 

  int ARR_SIZE, // array size. shold be less or eq that CTA_SIZE

  class T,	  // type of array elements

  template <class> class LESS  // LESS<T>::exec(A,B) return true if A < B

>

__device__ void BubleSort(T data[])

{

  int idInChunck = threadIdx.x % ARR_SIZE;

  int chunckOffset = (threadIdx.x/ARR_SIZE)*ARR_SIZE;

for(int i=0;i<ARR_SIZE;i++)

  {

	int index = (idInChunck+i) % ARR_SIZE + chunckOffset;

	if(threadIdx.x%2 == 0 && index+1 < ARR_SIZE + chunckOffset)

	{

	  T A = data[index];

	  T B = data[index+1];

	  if(!LESS<T>::exec(A,B))

	  {

		data[index] = B;

		data[index+1] = A;

	  }

	}

	__syncthreads();

  }

}

tnanks.

i forgot to say that elements have size == 8 bytes. And there floting point data.

For small arrays you should be able to use a sorting network (see Knuth). Hopefully nvcc will convert min/max expressions into suitable branchless instruction sequences but I haven’t verified this.

http://en.wikipedia.org/wiki/Sorting_network

thanks, i’ll try to do that.

You want Bitonic sort, especially since it’s already in the CUDA examples.

Sorting networks for N=32 inputs are quite huge and require a lot of comparators (for N=24 it was around to 129 compare-and-swap operations already).

Things get simpler for small arrays: For N=8 inputs I’ve successfully implemented a sorting network in an OpenGL ARBfp1.0 fragment shader. ;)

Christian

so, you think that sorting network will be too huge, may be you have another idea?