Why is the z dimension smaller than the total thread block size limit

Hi, this might not be the exact same case on all gpus, but on my p1000 the max dimensions of a thread block are listed as (1024,1024,64) with max number of threads per block being 1024.

I initially thought that the 3d blocks are just an abstraction and can be treated as if they had a layout of a 1d array so for example for a warp size of 32 a threadblock of dimensions (16,2,1) would execute in 1 warp, same as a threadblock of dimensions (32,1,1). But if that’s the case then there shouldn’t be a reason that a threadblock of size (1,1,1024) wouldn’t be same as a threadblock of dimensions (1,1024,1) or (1024,1,1). So why is this limitation in place?

It’s a hardware limit. You can think of a 3D block as being an “abstraction” of a 1D block, but there is more to it than that. The hardware, for example, supports the retrieval of the thread indices, and this means retrieval of 3 dimensional thread indices. That is just one example of the way the hardware interacts with the code in this case. So the hardware has limits as to what it can support.