Correct setting of kernel parameters - problem with number of blocks

My GPU can run 512 threads per blocks, and has 512 x 512 x 64 blocks dimension. So I run kernel in one dimension like this:

<<<numblocks, threadsperblock>>>

maximum value if numblocks is 512, right?

But my code run with much higher values and gives correct results. Why is this happening?

Below is code of kernel if it help. It write every line number as binary code in 1D array (it’s represent of 2D matrix). Now I run it with few threads per block so id is compute from:

int id = threadIdx.x + blockDim.x * blockIdx.x;

but at start (when I saw problem) I run it with one thread per block so id was:

int id = blockIdx.x;
__global__ void writeCubeInBinaryForm(bool *dataArray, int n, long long int twoPowerN, 
									  long long int arraySize)
{
	int id = threadIdx.x + blockDim.x * blockIdx.x; //blockIdx.x;
	int idx,bufor=id;
	for (int i=0; i < n; i++)
	{
		idx = i + id * n;
		dataArray[idx] = bufor % 2;
		bufor *= 0.5;
	}
}

For what you are doing try the following

dim3 threadsperblock(512); // can be less than 512, recommend it be a multiple of 32
dim3 numblocks( x/threadsperblock.x ); // where x total number of threads you need.

and to get a unique id for each thread use the line
int id = threadIdx.x + blockDim.x * blockIdx.x;

Thanks for reply kbam. I do in this way now. But I’m still curious why my code run correctly when I’m run code with set:

dim3 threadsperblock(1);
dim3 numblocks(1024);

?