"3D" grids Is there a standard method?

With the 2D limitation for grid layout, is there a standard method people have come up with for tiling 3D grids?

I’ve been playing around with grids of dimension [x y*z] and decomposing that second linear index, but I’ve run into trouble with dimensions of sizes that don’t divide cleanly, e.g. [32 32 17] over a [4 4 4] thread block or [32 17 23] over [4 4 4]. For dimensions of such arbitrary size, people typically calculate the grid using some sort of “divup” function: divide by the thread block size and add one more grid block if there was a remainder.

int divup(int a, int b)


    if (a % b)  /* does a divide b leaving a remainder? */

        return a / b + 1; /* add in additional block */


        return a / b; /* divides cleanly */


Anyone have a standard method they use?


My grids are 64 - 256 in the x and y directions and arbitrary in z so I do this:

x = threadIdx.x;

y = blockIdx.x;

z = blockIdx.y;

Here’s a solution that works. Any better techniques come to mind?

Here’s the kernel to write out each index:

static void __global__ ind2global(float *d_out, dim3 dims, int y_div)


    /* grid dimension is [w h*d] */

    int x = blockIdx.x * blockDim.x + threadIdx.x;

    int y = (blockIdx.y % y_div) * blockDim.y + threadIdx.y;

    int z = (blockIdx.y / y_div) * blockDim.z + threadIdx.z;

   if (x >= dims.x || y >= dims.y || z >= dims.z) return;

   int ind = z * dims.y * dims.x + x * dims.y + y;

    d_out[ind] = ind;


Here’s the general launch setup using the divup() function from above:

dim3 grid(divup(dims.x, BLOCK_LEN), divup(dims.y*dims.z,BLOCK_LEN*BLOCK_LEN));


int y_div = divup(dims.y, BLOCK_LEN);

ind2global<<<grid,threads>>>(d_out, dims, y_div);

Correction, that code I posted does not work in the corner cases where the remainders come into play, e.g. [4 9 5] won’t index the last block. Correcting the y-grid computation fixed that:

dim3 grid(divup(dims.x, BLOCK_LEN), divup(dims.y,BLOCK_LEN)*divup(dims.z,BLOCK_LEN));


You probably want to avoid the integer division/modulo operations if you can help it.

Yeah, try to use just bitwise-AND. Pad your grid’s y dimension to a Po2. It should be little overhead to start a pad block and immediately return out of each of its threads.

EDIT: obviously, this makes sense if you’re going to be calculating your dimensions frequently. You might do this to save a few registers or because you have many device functions and don’t want to pass around parameters needlessly. Then again, you can still just store the block’s coordinates in smem and then offset by threadIdx on the spot whenever you need to. Yeah, that’d be best.

Uh, I can’t imagine anyone actually uses that code, unless they want to avoid an integer overflow issue (which I doubt you will have).

The standard formula for this, at least for constant b, is “(a + b - 1) / b”.

If you have to use above code, “return a / b + !!(a % b)” will probably create better code with many compilers, though that may not be worth the obfuscation.