Hi. I have a large number of arrays (WinWidth*WinHeight exactly). Each array have fixed size == 32.

I need to sort them as fast as it possible.

I wrote buble sort, but it is too slow. What sorting algorithm will be better to use in my case?

btw. my bubble sort.

/////////////////////////////////////////////////////////////////////////
////
template
<
int ARR_SIZE, // array size. shold be less or eq that CTA_SIZE
class T, // type of array elements
template <class> class LESS // LESS<T>::exec(A,B) return true if A < B
>
__device__ void BubleSort(T data[])
{
int idInChunck = threadIdx.x % ARR_SIZE;
int chunckOffset = (threadIdx.x/ARR_SIZE)*ARR_SIZE;
for(int i=0;i<ARR_SIZE;i++)
{
int index = (idInChunck+i) % ARR_SIZE + chunckOffset;
if(threadIdx.x%2 == 0 && index+1 < ARR_SIZE + chunckOffset)
{
T A = data[index];
T B = data[index+1];
if(!LESS<T>::exec(A,B))
{
data[index] = B;
data[index+1] = A;
}
}
__syncthreads();
}
}

For small arrays you should be able to use a sorting network (see Knuth). Hopefully nvcc will convert min/max expressions into suitable branchless instruction sequences but I haven’t verified this.