Suppose I have a GPU that allows the MAX_THREAD
number of threads per block.
Also, suppose it allows the MAX_BLOCK_DIM
number of blocks per grid on each grid dimension of x, y, and z.
If MAX_THREAD
= 1024, and if dim3 threads_per_block
is set to [32, 8, 4], as 32*8*4=1024, how can I calculate each dimension of dim3 blocks_per_grid
so that I can launch a kernel like the following?
my_kernel<<<blocks_per_grid, threads_per_block>>>(... ... ...);
For example,
dim3 threads_per_block(x, y, z);
dim3 blocks_per_grid(xx, yy, zz);
Can I calculate the values of xx
, yy
, and zz
from x
, y
, and z
, respectively?
If not, what is the proper way to do this?
Typically you would compute them as follows:
int dimx = ...;
int dimy = ...;
int dimz = ...;
dim3 block(32, 8, 4);
dim3 grid((dimx+block.x1)/block.x, (dimy+block.y1)/block.y, (dimz+block.z1)/block.z);
And this assumes:

The dimx
, dimy
, and dimz
may not be wholenumber divisible by block.x
, block.y
, and block.z
respectively. Therefore it is assumed that you want to launch a grid of blocks that is large enough to cover your dimensions.

In your kernel you have an appropriate threadcheck such as:
__global__ void k(..., int dimx, int dimy, int dimz){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
int idy = threadIdx.y+blockDim.y*blockIdx.y;
int idz = threadIdx.z+blockDim.z*blockIdx.z;
...
if (idx < dimx && idy < dimy && idz < dimz){ //threadcheck
// body of kernel code
}
The (a+b1)/b
is a general formula for integer roundup division of a/b
. Just work through examples until you understand it. Remember that division of positive integers in C++ normally truncates. This formula expects this kind of truncation but yields the next integer greaterthan or equalto the actual value of a/b
this online course covers these and other CUDA basics.
1 Like
dimx
, dimy
, dimz
seem to be the dimensions of the data structure.
What if my data structure is in 2D, but I want to use all the dimensions in the grid?
I.e., say, I want to run a matrix multiplication using all three grid dimensions.