blockDim component order matters?

I’m running a realistic and a trivial kernel on GeForce GTX 960 and CUDA 7.5 on linux, and I found that
for realistic kernel, if I provide block dimensions as dim3(1,1,512), it throws cudaErrorInvalidConfiguration, but works on
block dimensions = dim3(512, 1, 1), or up to dim3(1, 1, 64). For the trivial kernel, it runs all of them, but if I save blockDim.z from the kernel runtime, it gets limited to 64.

Clearly, this is not an issue with my occupancy, since the total number of threads remains the same.

Why is there an assymetric blockDim index treatment?


The maximum block z-dimension is limited by the hardware to 64. This is documented:

Silly me! Thanks!