Problem with my Bitonic Sort Program The device function gets stuck after a point

Hi all.

I have a problem with my “bitonic sort” program that I wrote. Its basically supposed to be able to sort arrays of powers-of-two size. As far as I can tell, the logic of the code is correct, but the code only manages to give correct answer for array size up to 4096.

Here are the functions that basically do the main sorting/merging job:

[codebox]__device__ void merge_bitonic_sub(float * array, int num_merge)

{

__shared__ float sarr[NUM_ELEMENTS];

int bx = blockIdx.x;

int tx = threadIdx.x;

int p = num_merge/1024;

int k, i, j;

for(;num_merge>1;) {

if((bx&p)==0)  { //INCREASING ORDER

   if(num_merge==1024) {

	   k = num_merge/2;

	   if(array[bx*1024 + (tx/k)*k + tx] > array[bx*1024 + (tx/k)*k + tx + k]) {

		  swap(array[bx*1024 + (tx/k)*k + tx], array[bx*1024 + (tx/k)*k + tx + k]);

	   }

   }   

   else if(num_merge<=512&&num_merge!=1024) {

	   k = num_merge/2;

	   if(num_merge==512) {

		  sarr[tx] = array[bx*BLOCK_SIZE + tx];

		  sarr[tx + 512] = array[bx*BLOCK_SIZE + tx + 512];

	   }

	   if(sarr[tx + (tx/k)*k] > sarr[tx + (tx/k)*k + k])  {

		   swap(sarr[tx + (tx/k)*k], sarr[tx + (tx/k)*k + k]);

	   }

   }

   else {

	   k = log2f(num_merge/1024);

	   i = (bx/(2*k))*((int)exp2f(10+k)) + (bx%(2*k))*512 + tx;

	   j = (bx/(2*k))*((int)exp2f(10+k)) + (bx%(2*k))*512 + tx + ((int)exp2f(9+k));

	   if(array[i] > array[j]) {

		   swap(array[i], array[j]);

	   }

   }

   __syncthreads(); 

}

else { //DECREASING ORDER

   if(num_merge==1024) {

	   k = num_merge/2;

	   if(array[bx*1024 + (tx/k)*k + tx] < array[bx*1024 + (tx/k)*k + tx + k]) {

		  swap(array[bx*1024 + (tx/k)*k + tx], array[bx*1024 + (tx/k)*k + tx + k]);

	   }

   }

   else if(num_merge<=512&&num_merge!=1024) {

	   k = num_merge/2;

	   if(num_merge==512) {

		  sarr[tx] = array[bx*BLOCK_SIZE + tx];

		  sarr[tx + 512] = array[bx*BLOCK_SIZE + tx + 512];

	   }

	   if(sarr[tx + (tx/k)*k] < sarr[tx + (tx/k)*k + k])  {

		   swap(sarr[tx + (tx/k)*k], sarr[tx + (tx/k)*k + k]);

	   }

   }  

   else {

	   k = log2f(num_merge/1024);

	   i = (bx/(2*k))*((int)exp2f(10+k)) + (bx%(2*k))*512 + tx;

	   j = (bx/(2*k))*((int)exp2f(10+k)) + (bx%(2*k))*512 + tx + ((int)exp2f(9+k));

	   if(array[i] < array[j]) {

		   swap(array[i], array[j]);

	   }

   }

   __syncthreads(); 

}

num_merge = num_merge/2;

}

array[bx*BLOCK_SIZE + tx] = sarr[tx];

array[bx*BLOCK_SIZE + tx + 512] = sarr[tx + 512];

}[/codebox]

Basically, I call the first function from main(), and it gets an input array of size “2^x” ( >=1024). Also, each thread block always processes 1024 nos.

Now, the input is such that every 1024 numbers in the array are bitonic sequences (i.e. the first 512 no.s increase, and next 512 no.s decrease). This is done by another function.

Now, it works fine for array size of up to 4096. But when I try an 8192 sized array, the 2nd function (merge_bitonic_sub) gets stuck in its last iteration, on device emulation (without emulation it will return control to main() but gives the wrong answer). I found this out by placing printfs in the functions and checking the values of threadIdx.x and BlockIdx.x. Turns out that the last few thread blocks don’t finish execution (i.e. they kind of hang or halt, and execution doesn’t proceed in emulation mode)

Without emulation mode, if I try to sort 8192 no.s, the program finishes execution but the output is same as input. I am guessing that this is because the subsequent cudaMemcpy doesn’t execute.

I don’t understand why the blocks don’t execute in the last iteration. Can anyone please tell me a way to get rid of this problem, I would be very glad.

System I am running >>

Ubuntu 8.04, Quadro FX 1400.

Please reply to my post. I am really stuck and unable to debug my code.