cuda kernel just works for a specific maximum number of elelmentsarray o N elements

Hi,

when trying to find a specific number in an array (smallest) it just works for several array lengths and not for all ones.

For example, if N=3560 or less it works but if N=4560 or bigger it does not works…

Is there any limits i am not respecting?

Note: I get the return (minor number) on the first element of the array, on exit…

const int N=4560;

const int threadsperblock=512;

const int blocksPerGrid=((N/2)/threadsperblock)+1;

__global__ void minor(int *array)

{

	int tid=threadIdx.x + (blockIdx.x * blockDim.x); 

	int nMiddle=(N/2)+1;

	while(nMiddle > 0)

	{

	   if((tid < nMiddle) && (tid+nMiddle < N))

	   {

	      if(array[tid+nMiddle] < array[tid])

		  {

		     array[tid]=array[tid+nMiddle];

		  }

	   }

	   __syncthreads();

	   if(nMiddle >= 3)

	      nMiddle=(nMiddle/2) + 1;

	   else

	      nMiddle=(nMiddle/2);

    }

}

...

for (int i=0; i<N; i++)

{

	a[i] = N-i;

	printf("%d ",N-i);

}

	

ret=cudaMemcpy(dev_a,a,N*sizeof(int),cudaMemcpyHostToDevice);

if(ret != cudaSuccess)

{

   printf("Error cudaMemcpy 1\n");

   exit(1);

}

	

minor<<<blocksPerGrid,threadsperblock>>>(dev_a);

...

Env: windows 7, NVIdia QUADRO FX580

Thanks

JoseBB

[font=“Courier New”]__syncthreads()[/font] synchronizes only within a block, not between blocks, so your algorithm does not work in general.

Look at the reduction example in the SDK for how to implement this properly.

I assume your code is really meant to sort arrays, not just find the minimum.

What happens between N= 3560 and 4560?

When N= 3560 it says that 1 is the minimum. It is right because I initialize the array from 1 to N.
When N= 4560 or more it says that the minimum is 1250. I print the whole returned array and the 1250 is in the first element.

Yes, the algorithm sorts ascendly and then I print the array[0] as the minimum value on the array…

Well, If __syncthreads() just syncs inside each block that may be the problem… May be i have to use a shared variable in each block and each block computes its minimum and then the CPU gets the minim of all returned minimus, right?

JoseBB