# How can I calculate blocks per grid?

Suppose I have a GPU that allows the `MAX_THREAD` number of threads per block.

Also, suppose it allows the `MAX_BLOCK_DIM` number of blocks per grid on each grid dimension of x, y, and z.

If `MAX_THREAD` = 1024, and if `dim3 threads_per_block` is set to [32, 8, 4], as 32*8*4=1024, how can I calculate each dimension of `dim3 blocks_per_grid` so that I can launch a kernel like the following?

``````my_kernel<<<blocks_per_grid, threads_per_block>>>(... ... ...);
``````

For example,

``````dim3 threads_per_block(x, y, z);
dim3 blocks_per_grid(xx, yy, zz);
``````

Can I calculate the values of `xx`, `yy`, and `zz` from `x`, `y`, and `z`, respectively?

If not, what is the proper way to do this?

Typically you would compute them as follows:

``````int dimx = ...;
int dimy = ...;
int dimz = ...;

dim3 block(32, 8, 4);
dim3 grid((dimx+block.x-1)/block.x, (dimy+block.y-1)/block.y, (dimz+block.z-1)/block.z);
``````

And this assumes:

1. The `dimx`, `dimy`, and `dimz` may not be whole-number divisible by `block.x`, `block.y`, and `block.z` respectively. Therefore it is assumed that you want to launch a grid of blocks that is large enough to cover your dimensions.

``````__global__ void k(..., int dimx, int dimy, int dimz){
...
if (idx < dimx && idy < dimy && idz < dimz){  //thread-check
// body of kernel code
}
``````

The `(a+b-1)/b` is a general formula for integer round-up division of `a/b`. Just work through examples until you understand it. Remember that division of positive integers in C++ normally truncates. This formula expects this kind of truncation but yields the next integer greater-than or equal-to the actual value of `a/b`

this online course covers these and other CUDA basics.

1 Like

`dimx`, `dimy`, `dimz` seem to be the dimensions of the data structure.

What if my data structure is in 2D, but I want to use all the dimensions in the grid?

I.e., say, I want to run a matrix multiplication using all three grid dimensions.