Faulty sort algorithm. Please help! Odd even sort

I’m trying to create an odd-even sort algorithm in CUDA, but the one I created works partially only.

here’s the code

__global__ void oddevensort(int *in, int *out, int size)

{

	bool oddeven=true;

	bool swappedodd=true;

	bool swappedeven=true;

	while(true)

	{

  if(oddeven==true)

  {

  	__syncthreads();

  	swappedodd=false;

  	int idx=threadIdx.x;

  	if(idx<(size/2))

  	{

    if (in[2*idx]>in[2*idx+1])

    {

    	swap(in[2*idx],in[2*idx+1]);

    	swappedodd=true;

    }

  	}

  	__syncthreads();

  }

  else

  {

  	__syncthreads();

  	swappedeven=false;

  	int idx=threadIdx.x;  	

  	if(idx<(size/2)-1)

  	{

    if (in[2*idx+1]>in[2*idx+2])

    {

    	swap(in[2*idx+1],in[2*idx+2]);

    	swappedeven=true;

    }

  	}

  	__syncthreads();

  }

  if(!(swappedodd||swappedeven))

    break;

  oddeven=!oddeven;//switch mode of sorting

	}

	__syncthreads();

	int idx=threadIdx.y*blockDim.x+threadIdx.x;

	if ( idx <size )

  out[idx]=in[idx];

}

i tried sorting an array of consecutive numbers in descending order and it worked.

However, when I tried random numbers in random order, the code did not function as expected.

Can anyone help me with this?

For more info on Odd-even sort, look at this.

http://en.wikipedia.org/wiki/Odd-even_sort

I am pretty new to this CUDA thing myself, so my answer may be off…

Initially, I assumed that you are scheduling more than 1 block simultaneously, i.e. your array is bigger than 512 elements. But this is apparently not be the case when inspecting your code closely. I am not seeing any use of the blockIdx variable.

Your definition of idx differs inside and outside the while() loop.

int idx=threadIdx.x;

int idx=threadIdx.y*blockDim.x+threadIdx.x;

This confuses me. The second definition of idx suggests a 2-dimensional thread block. The only way I see this working if your blockDim.y is exactly 2 and blockDim.x is half the number of array elements per thread block.

But for each threadIdx.x there are two threads running, one of which would not be required for sorting, just for producing the out array. Isn’t this wasting a lot of GPU resources?

Also there might be race conditions among the two threads with identical threadIdx.x (between the comparison and the actual swap). For example because both instances initially find a reason to swap and both threads then swap the values. ;) This may be your sorting bug.

Then there’s one more problem that concerns your boolean state variables. How do you make sure that you only terminate when ALL threads report no swapping? Right now you set the swap states to false quite independently in each indidual thread. This may cause quite some confusion. You need to have some means of determining that none of the threads have swapped during the last iteration. Maybe some atomic counters instead of booleans where you count the number of non-swap events? But my CUDA knowledge is not enough to help you with this. Also it is my understanding that the termination variable must be in a block-shared scope, not thread-local.

If swappedodd and swappedeven were shared variables AND you also did a __synchthreads() after setting it to false, this might work. Any expert opinions on this?

The following applies if you intend to work on large arrays of >512 elements that require sending many blocks to the GPU. Isn’t it that each thread block works separately and independently on its assigned Streaming Multiprocessor? Hence each block will independently evaluate the termination condition - that means each block would stop when the contained elements are sorted, but the check may fail when checking across block boundaries.

When two independent thread blocks actually try to compare values across the block boundary, this will not happen in a synchronized manner because the __syncthreads() intrinsic only synchronizes all the threads belonging to the same thread block - but never across blocks. So for example the neighbor block may not even have started sorting yet when you perform the comparison.

Termination of the entire kernel forces synchronization across all thread blocks, so the only solution I see is to make an entire new invocation of the kernel. My take is that you will need an outer loop across several (possibly MANY) kernel invocation to correctly sort (and swap) element pairs that are crossing block boundaries. By offsetting the block boundaries on each kernel invocation (similar to the odd/even approach) you may get better results.

Here’s how to fix your bug (hopefully):

  • reduce the Y dimension of the thread block to 1. Make each thread copy 2 values into the out array instead.

  • find a way to only terminate when all threads find no swaps. Try my suggestions of making swappedeven and swappedodd shared variables and also use syncthreads after setting them to false. In fact it would be enough if only one thread (say the one with threadIdx.x == 0) resets it to false.

Can you unroll the while() loop such that it always performs the Odd and Even passes in one iteration? This aleviates the need for the oddeven flag and you will only need three synchtreads per iteration, each one right after setting the corresponding swap flag to false. One more syncthreads may be needed before checking the exit condition.

And now a few ideas for making this work with LARGE array >512 elements, so you will have to schedule more than 1 thread block simultaneously by using a Grid. Using grids is also much better because then more Streaming Multiprocessors will get used - exploiting more parallelism.

  • sort each block independently (i.e. don’t try to compare or swap elements across block boundaries)

  • you can call the sorting kernel multiple times, offsetting the element boundaries by half the block size on every second invocation of the kernel, e.g. increment idx by an offset that you specify in constant memory.

  • After each kernel invocation, your outer loop on the host CPU should check if the sorting condition is fulfilled across the block boundaries. If not, you have to repeat the kernel call.

  • In each thread block, copy the corresponding range of in values into shared memory before commencing sorting. Hence the following iterations, the comparison and swap operations will no longer need to access global memory, resulting (hopefully) in a huge speedup.

Note that there are sorting algorithms out there that map much better to the GPU than this approach. ;-)

My bugfixing suggestions actually worked. I used the Visual C++ 2005 project template example from the SDK and I am sorting floats, not integers. I compare the output against a quicksort implementation on the CPU and it passes: meaning identical output.

I kept your original thread block dimension of 2 in the vertical dimension, but I avoided the race condition by checking for threadIdx.y == 0. Of course this is a huge performance drain because half the threads are idling.

I also appended a shared memory version of the code which seems to work faster. Note that you may need to change the kernel invocation to pass the array memory size as a third argument. I haven’t checked for bank conflicts yet.

EDIT: I added two more updates, these are further optimized versions. Check the comments in the code.

Tomorrow we will dive into a grid implementation and sort arrays with 1M entries ;)

#define swap(a, b) \

{ \

float tmp = a; \

a = b; \

b = tmp; \

}

__global__ void testKernel(float *in, float *out, int size)

{

bool oddeven=true;

__shared__ bool swappedodd;

__shared__ bool swappedeven;

swappedodd=true;

swappedeven=true;

while(true)

{

 if(oddeven==true)

 {

  __syncthreads();

  swappedodd=false;

  __syncthreads();

  if (threadIdx.y == 0) {

   int idx=threadIdx.x;

   if(idx<(size/2))

   {

    if (in[2*idx]>in[2*idx+1])

    {

     swap(in[2*idx],in[2*idx+1]);

     swappedodd=true;

    }

   }

  }

  __syncthreads();

 }

 else

 {

  __syncthreads();

  swappedeven=false;

  __syncthreads();

  if (threadIdx.y == 0) {

   int idx=threadIdx.x;  

   if(idx<(size/2)-1)

   {

    if (in[2*idx+1]>in[2*idx+2])

    {

     swap(in[2*idx+1],in[2*idx+2]);

     swappedeven=true;

    }

   }

  }

  __syncthreads();

 }

 if(!(swappedodd||swappedeven))

   break;

 oddeven=!oddeven;//switch mode of sorting

}

__syncthreads();

int idx=threadIdx.y*blockDim.x+threadIdx.x;

if ( idx <size )

 out[idx]=in[idx];

}

//============================================================

//

// Shared Memory version follows

//

#define SDATA( index)      CUT_BANK_CHECKER(sdata, index)

#define swap(a, b) \

{ \

float tmp = a; \

a = b; \

b = tmp; \

}

__global__ void testKernel(float *in, float *out, int size)

{

bool oddeven=true;

__shared__ bool swappedodd;

__shared__ bool swappedeven;

// shared memory

// the size in bytes is determined by the host application as

// the third argument in kernel invocation

extern  __shared__  float sdata[];

// copy input data to shared memory

int idx=threadIdx.y*blockDim.x+threadIdx.x;

if ( idx < size )

  SDATA(idx) = in[idx];

__syncthreads();

swappedodd=true;

swappedeven=true;

while(true)

{

 if(oddeven==true)

 {

  __syncthreads();

  swappedodd=false;

  __syncthreads();

  if (threadIdx.y == 0) {

   int idx=threadIdx.x;  // this shadows an existing idx variable

   if(idx<(size/2))

   {

    if (SDATA(2*idx)>SDATA(2*idx+1))

    {

     swap(SDATA(2*idx),SDATA(2*idx+1));

     swappedodd=true;

    }

   }

  }

  __syncthreads();

 }

 else

 {

  __syncthreads();

  swappedeven=false;

  __syncthreads();

  if (threadIdx.y == 0) {

   int idx=threadIdx.x;  // this shadows an existing idx variable

   if(idx<(size/2)-1)

   {

    if (SDATA(2*idx+1)>SDATA(2*idx+2))

    {

     swap(SDATA(2*idx+1),SDATA(2*idx+2));

     swappedeven=true;

    }

   }

  }

  __syncthreads();

 }

 if(!(swappedodd||swappedeven))

   break;

 oddeven=!oddeven;//switch mode of sorting

}

__syncthreads();

// copy data from shared memory to output

if ( idx < size )

 out[idx]=SDATA(idx);

}

//============================================================

//

// Shared Memory and unrolled while loop version follows

// I also made only one master thread reset the swap flags to false

//

#define SDATA( index)      CUT_BANK_CHECKER(sdata, index)

#define swap(a, b) \

{ \

float tmp = a; \

a = b; \

b = tmp; \

}

__global__ void testKernel(float *in, float *out, int size)

{

__shared__ bool swappedodd;

__shared__ bool swappedeven;

// shared memory

// the size in bytes is determined by the host application as

// the third argument in kernel invocation

extern  __shared__  float sdata[];

// copy input data to shared memory

int idx=threadIdx.y*blockDim.x+threadIdx.x;

if ( idx < size )

  SDATA(idx) = in[idx];

__syncthreads();

swappedodd=true;

swappedeven=true;

while(true)

{

  // set swap flag to false for odd indices

  __syncthreads();

  if (threadIdx.x == 0 && threadIdx.y == 0) swappedodd=false;

  __syncthreads();

  

  if (threadIdx.y == 0) {

   int idx=threadIdx.x;  // this shadows an existing idx variable

   if(idx<(size/2)) {

    if (SDATA(2*idx)>SDATA(2*idx+1)) {

     swap(SDATA(2*idx),SDATA(2*idx+1));

     swappedodd=true;

    }

   }

  }

  

  // check exit condition

  __syncthreads();

  if(!(swappedodd||swappedeven)) break;

 // set swap flag to false for even indices

  __syncthreads();

  if (threadIdx.x == 0 && threadIdx.y == 0) swappedeven=false;

  __syncthreads();

  

  if (threadIdx.y == 0) {

   int idx=threadIdx.x;  // this shadows an existing idx variable

   if(idx<(size/2)-1) {

    if (SDATA(2*idx+1)>SDATA(2*idx+2)) {

     swap(SDATA(2*idx+1),SDATA(2*idx+2));

     swappedeven=true;

    }

   }

  }

  

  // check exit condition

  __syncthreads();

  if(!(swappedodd||swappedeven)) break;

}

__syncthreads();

// copy data from shared memory to output

if ( idx < size )

 out[idx]=SDATA(idx);

}

#define SDATA( index)      CUT_BANK_CHECKER(sdata, index)

#define swap(a, b) \

{ \

float tmp = a; \

a = b; \

b = tmp; \

}

//============================================================

//

// Final version for today... promised!

// This is like above unrolled version, but this thread-block is now a 1-dimensional 

// thread block. Change your block Y dimension from 2 to 1 !

// Also I precompute 2*idx and 2*idx+1 to save computation cycles.

// Really, the compiler should have optimized that, but.... no.

// More speed-up could be gained by dropping the checks against size, but

// that would require input array size to match the number of threads always.

// In a GRID implementation with many blocks, we could achieve that simply by 

// padding the final block array at the end with the largest possible float numbers.

// These cannot not interfere with the sort.

#define SDATA( index)      CUT_BANK_CHECKER(sdata, index)

#define swap(a, b) \

{ \

float tmp = a; \

a = b; \

b = tmp; \

}

__global__ void testKernel(float *in, float *out, int size)

{

__shared__ bool swappedodd;

__shared__ bool swappedeven;

// shared memory

// the size in bytes is determined by the host application as

// the third argument in kernel invocation

extern  __shared__  float sdata[];

// get thread index

int idx2   = 2*threadIdx.x;

int idx2_1 = idx2+1;

int idx2_2 = idx2+2;

// copy input data to shared memory

if ( idx2   < size ) SDATA(idx2  ) = in[idx2  ];

if ( idx2_1 < size ) SDATA(idx2_1) = in[idx2_1];

__syncthreads();

if (threadIdx.x == 0) {

  swappedodd=true;

  swappedeven=true;

}

while(true)

{

  // master thread sets swap flag for odd indices to false 

  __syncthreads();

  if (threadIdx.x == 0) swappedodd=false;

  __syncthreads();

  

  // swap even with odd elements

  if(idx2_1 < size) {

   if (SDATA(idx2) > SDATA(idx2_1)) {

    swap(SDATA(idx2),SDATA(idx2_1));

    swappedodd=true;

   }

  }

  

  // check exit condition

  __syncthreads();

  if(!(swappedodd||swappedeven)) break;

 // master thread sets swap flag for even indices to false

  __syncthreads();

  if (threadIdx.x == 0) swappedeven=false;

  __syncthreads();

  

  // swap odd with even elements

  if(idx2_2 < size) {

   if (SDATA(idx2_1) > SDATA(idx2_2)) {

    swap(SDATA(idx2_1),SDATA(idx2_2));

    swappedeven=true;

   }

  }

  

  // check exit condition

  __syncthreads();

  if(!(swappedodd||swappedeven)) break;

}

__syncthreads();

// copy data from shared memory to output

if ( idx2   < size ) out[idx2]  = SDATA(idx2  );

if ( idx2_1 < size ) out[idx2_1]= SDATA(idx2_1);

}

Haha! LOL

So I did the outer loop across kernel calls and I am overlapping the blocks by half the block size in each second call, thus allowing for array elements to traverse from one block to the next in 2 kernel calls. So in total we need 2*gridDim.x explicit calls of the kernel to allow an element to traverse the entire array in the worst case (i.e. when the input array was reverse-sorted)

In the CUDA kernel I dropped the comparisons against the array size, instead I pad the sort array with the largest floating point number accrording to IEEE754 to provide enough elements to generate a multiple of 1024 input values. Each thread block processes exactly 1024 values with its 512 threads. I also managed to get rid of bank conflicts by splitting the odd and even elements apart while maintaining a nice read and write coalescing. I am quite happy with the performance of the kernel now.

For arrays of 100000 and more (that is one hundred thousand) elements, the grid version of the algorithm with an outer loop across the kernel calls develops the following abysmal performance:

Using device 0: Geforce 8800 GT

Number of elements sorted: 1000

Grid is 1 blocks wide, block has 512 threads

Processing time of odd-even sort on GPU: 1.282789 (ms)

Processing time of quicksort on host CPU: 0.160888 (ms)

GPU vs. CPU speedup factor: 0.125421

Test PASSED

Number of elements sorted: 10000

Grid is 10 blocks wide, block has 512 threads

Processing time of odd-even sort on GPU: 21.205631 (ms)

Processing time of quicksort on host CPU: 2.185392 (ms)

GPU vs. CPU speedup factor: 0.103057

Test PASSED

Number of elements sorted: 100000

Grid is 98 blocks wide, block has 512 threads

Processing time of odd-even sort on GPU: 1121.042236 (ms)

Processing time of quicksort on host CPU: 27.388626 (ms)

GPU vs. CPU speedup factor: 0.024431

Test PASSED

Number of elements sorted: 1000000

Grid is 977 blocks wide, block has 512 threads

Processing time of odd-even sort on GPU: 103628.375000 (ms)

Processing time of quicksort on host CPU: 250.517975 (ms)

GPU vs. CPU speedup factor: 0.002417

Test PASSED

Press ENTER to exit...

Well… at least it passed the test against quicksort concerning correctness.

For a million elements, you end up with about 100 seconds of GPU runtime, whereas the CPU ended up using 250ms. OUCH. Still correct results, but crazy run time.

So the lesson learnt here is:

  • First pick an appropriate algorithm that scales, only then start optimizing

I’ll post the complete source package tomorrow. It is quite well commented, I think.

Christian

Find the source code to this project here. VisualC++ 2005 Express project file included.

Best unzip to your CUDASDK\projects folder. Enjoy.

Christian
myproject1.zip (7.81 KB)

Thanks alot!
I’m going to try work it out right now.
=)

I’ve employed some more tricks like loop unrolling, getting rid of special cases, i.e. branch divergence… Now I am down to 61 seconds for 1 million elements. I’ve attached this version as V2.

I believe a giant speedup could be obtained by sorting each warp independently, that is 64 floats per warp and by calling the kernel more often with an offset of WARPSIZE (32) elements.

This way each warp terminates independently from other warps (no more shared swap flags among warps are needed), we can get rid of most syncthread() calls and the GPU will get much better utilized.

So now how do I generate warp-local variables? Argh ;)

Christian
myproject1_V2.zip (9.67 KB)

Down to 42 seconds for a million elements! Updated version (V3) appended.

This was achieved by making sure the writes to the swap flags in each half-warp (8 threads) can not trigger bank conflicts. This means I need 8 separate swap flags. For checking the exit condition, I have to OR these flags together. That stripped 20 seconds from the tally. Impressive!

More speedups could be gained by analyzing the PTX assembly code and optimizing the swap procedure, if possible. OK, It’s rather pointless trying to make this algorithm fast, but it’s a good learning experience…

I’ve also got a version that does work completely without __synchthreads() by making each sort local to a warp. So 64 floats are locally sorted by 32 threads in completely separate bins. Then the outer loop on the host CPU offsets each 2nd kernel invocation by 32 floats relative to the original sort bins. Unfortunately the speed gain is negated by the much larger number of kernel invocations required in the outer loop. I am currently at 80 seconds for 1 mio elements in this version.
myproject1_V3.zip (6.35 KB)

I was examining the bitonic sort code and was wondering whether we could do away with the flags, but i found out that it is rather impossible, unless we know for sure that by a certain amount of passes, the array will be definitely sorted.( so that we can use a for loop just like in the bitonic sort in the SDK)

Also, i was thinking that instead of syncronising in the function, why not we do 2 functions ( oddeven sort and evenodd sort) so that it would be syncronised automatically as it is passed back to the CPU. But, I’ve heard that the transfering of data from CPU to GPU and vice versa takes up the most time, so I think this method would yield even worse results.

Thinking Mayb instead of an oddeven bubble sort, we can try a oddeven mergesort using CUDA, So the number of passes can be easily determined.
Got the idea from this site: [url=“http://www.inf.fh-flensburg.de/lang/algorithmen/sortieren/networks/oemen.htm”]http://www.inf.fh-flensburg.de/lang/algori...works/oemen.htm[/url]

Anyway thanks.