How can I calculate blocks per grid?

Suppose I have a GPU that allows the MAX_THREAD number of threads per block.

Also, suppose it allows the MAX_BLOCK_DIM number of blocks per grid on each grid dimension of x, y, and z.

If MAX_THREAD = 1024, and if dim3 threads_per_block is set to [32, 8, 4], as 32*8*4=1024, how can I calculate each dimension of dim3 blocks_per_grid so that I can launch a kernel like the following?

my_kernel<<<blocks_per_grid, threads_per_block>>>(... ... ...);

For example,

dim3 threads_per_block(x, y, z);
dim3 blocks_per_grid(xx, yy, zz);

Can I calculate the values of xx, yy, and zz from x, y, and z, respectively?

If not, what is the proper way to do this?

Typically you would compute them as follows:

int dimx = ...;
int dimy = ...;
int dimz = ...;

dim3 block(32, 8, 4);
dim3 grid((dimx+block.x-1)/block.x, (dimy+block.y-1)/block.y, (dimz+block.z-1)/block.z);

And this assumes:

  1. The dimx, dimy, and dimz may not be whole-number divisible by block.x, block.y, and block.z respectively. Therefore it is assumed that you want to launch a grid of blocks that is large enough to cover your dimensions.

  2. In your kernel you have an appropriate thread-check such as:

    __global__ void k(..., int dimx, int dimy, int dimz){
      int idx = threadIdx.x+blockDim.x*blockIdx.x;
      int idy = threadIdx.y+blockDim.y*blockIdx.y;
      int idz = threadIdx.z+blockDim.z*blockIdx.z;
      if (idx < dimx && idy < dimy && idz < dimz){  //thread-check
        // body of kernel code

The (a+b-1)/b is a general formula for integer round-up division of a/b. Just work through examples until you understand it. Remember that division of positive integers in C++ normally truncates. This formula expects this kind of truncation but yields the next integer greater-than or equal-to the actual value of a/b

this online course covers these and other CUDA basics.

1 Like

dimx, dimy, dimz seem to be the dimensions of the data structure.

What if my data structure is in 2D, but I want to use all the dimensions in the grid?

I.e., say, I want to run a matrix multiplication using all three grid dimensions.