Hi, this might not be the exact same case on all gpus, but on my p1000 the max dimensions of a thread block are listed as (1024,1024,64) with max number of threads per block being 1024.
I initially thought that the 3d blocks are just an abstraction and can be treated as if they had a layout of a 1d array so for example for a warp size of 32 a threadblock of dimensions (16,2,1) would execute in 1 warp, same as a threadblock of dimensions (32,1,1). But if that’s the case then there shouldn’t be a reason that a threadblock of size (1,1,1024) wouldn’t be same as a threadblock of dimensions (1,1024,1) or (1024,1,1). So why is this limitation in place?