With the 2D limitation for grid layout, is there a standard method people have come up with for tiling 3D grids?

I’ve been playing around with grids of dimension [x y*z] and decomposing that second linear index, but I’ve run into trouble with dimensions of sizes that don’t divide cleanly, e.g. [32 32 17] over a [4 4 4] thread block or [32 17 23] over [4 4 4]. For dimensions of such arbitrary size, people typically calculate the grid using some sort of “divup” function: divide by the thread block size and add one more grid block if there was a remainder.

int divup(int a, int b)
{
if (a % b) /* does a divide b leaving a remainder? */
return a / b + 1; /* add in additional block */
else
return a / b; /* divides cleanly */
}

Correction, that code I posted does not work in the corner cases where the remainders come into play, e.g. [4 9 5] won’t index the last block. Correcting the y-grid computation fixed that:

Yeah, try to use just bitwise-AND. Pad your grid’s y dimension to a Po2. It should be little overhead to start a pad block and immediately return out of each of its threads.

EDIT: obviously, this makes sense if you’re going to be calculating your dimensions frequently. You might do this to save a few registers or because you have many device functions and don’t want to pass around parameters needlessly. Then again, you can still just store the block’s coordinates in smem and then offset by threadIdx on the spot whenever you need to. Yeah, that’d be best.

Uh, I can’t imagine anyone actually uses that code, unless they want to avoid an integer overflow issue (which I doubt you will have).

The standard formula for this, at least for constant b, is “(a + b - 1) / b”.

If you have to use above code, “return a / b + !!(a % b)” will probably create better code with many compilers, though that may not be worth the obfuscation.