cuda kernel just works for a specific maximum number of elelmentsarray o N elements


when trying to find a specific number in an array (smallest) it just works for several array lengths and not for all ones.

For example, if N=3560 or less it works but if N=4560 or bigger it does not works…

Is there any limits i am not respecting?

Note: I get the return (minor number) on the first element of the array, on exit…

const int N=4560;

const int threadsperblock=512;

const int blocksPerGrid=((N/2)/threadsperblock)+1;

__global__ void minor(int *array)


	int tid=threadIdx.x + (blockIdx.x * blockDim.x); 

	int nMiddle=(N/2)+1;

	while(nMiddle > 0)


	   if((tid < nMiddle) && (tid+nMiddle < N))


	      if(array[tid+nMiddle] < array[tid])






	   if(nMiddle >= 3)

	      nMiddle=(nMiddle/2) + 1;






for (int i=0; i<N; i++)


	a[i] = N-i;

	printf("%d ",N-i);




if(ret != cudaSuccess)


   printf("Error cudaMemcpy 1\n");






Env: windows 7, NVIdia QUADRO FX580



[font=“Courier New”]__syncthreads()[/font] synchronizes only within a block, not between blocks, so your algorithm does not work in general.

Look at the reduction example in the SDK for how to implement this properly.

I assume your code is really meant to sort arrays, not just find the minimum.

What happens between N= 3560 and 4560?

When N= 3560 it says that 1 is the minimum. It is right because I initialize the array from 1 to N.
When N= 4560 or more it says that the minimum is 1250. I print the whole returned array and the 1250 is in the first element.

Yes, the algorithm sorts ascendly and then I print the array[0] as the minimum value on the array…

Well, If __syncthreads() just syncs inside each block that may be the problem… May be i have to use a shared variable in each block and each block computes its minimum and then the CPU gets the minim of all returned minimus, right?