How to use dim3 threadsPerBlock and numBlocks when parallelizing loops

I have the following kernel:

__global__ void kernel(int *d_array) {
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
unsigned int z = blockIdx.z*blockDim.z + threadIdx.z;
if (x < 2000000 && y < 12500 && z < 100000) {
    /* do stuff */
}

It compiles fine, but when running it it gives: invalid configuration argument.

At the end I found out that I can only use Dim3 ThreadsPerBlocks as following:

Dim3 ThreadsPerBlocks(1,32,32)

The C programming guide says: “A thread block size of 16x16 (256 threads), although arbitrary in this case, is a common choice.”, so does not help at all and there is no information anywhere how dim3 ThreadsPerBlocks works when using all the three dimensions.

So, why I am not able to use something like Dim3 ThreadsPerBlocks(1024,1024,64) and then dim3 numBlocks(1954,13,1563)?

Appendix H of the CUDA Programming Guide unambiguously states that the maximum number of threads per block is 1024 across all currently supported GPU architectures. Using ThreadsPerBlocks(1024,1024,64) tries to configure a block with 64 million threads, which exceeds this limit.

Thanks, I missed that one, so that means ThreadsPerBlocks(1,32,32) → 32 x 32 x 1 = 1024 right?

Thanks, I found a post by Robert_Crovella explaining that.