3D thread blocks and arrays

Quick question: Is it possible to have 3D thread blocks in cuda, I assume the largest 3D thread block (with equal size dimensions) based on constraints of cuda will be 888=512, is that correct? Is it a good practice?
Also, could we have 3D arrays (i.e. a[y][z]) in the device code: global and shared memory? How is memory coalescing possible with this layout?

Thanks.

Yes

No. Most kernels compile to use too many registers for 512 thread blocks to even run

Shared memory: yes

Global memory: yes if you want a huge mess of pointer handling with device and host pointer management.

Impossible. If you access just accessing the pointer a from global memory is going to be uncoalesced.

Just use 1D arrays and calculate the memory location from 3D indices in the normal way. Then you have control over the coalescing of the threads.

I get configuration error when I try to launch a kernel with thread block size 444 and grid size 888.
When I change the dimensions to 2-D block and grid, then it will be solved.

Does CUDA really support 3-D blocks and grids, because I’ve heard from some people that it is not a working feature …
Any comments?

Oh, you’ve just heard down the grapevine that it doesn’t work. Just read the manual or look at the output from the SDK deviceQuery, they are more reliable. There are clearly documented grid dimension limitations of 65535x65535x1 (i.e. grids are 2D only).