grid size, block size

i m beginner for cuda. i have a problem with dimgrid, dimblock etc. for example if i multiple two matrix how can i decide gris and block size. actually there is many example in the net but i can t understand logic of this. i can t decide what is the differences one grid or more grid and one block or more blocks inside grid. one more thing i can t understand i can t know why blocks are necessary? what happens if only thread blocks inside grid.

about my problem, lets a matrix size of 5x5 and i want to multiple these. i think every pair of element of two matrix that will be multiple are thread. but when i want to decide grid size and block size i locked. how can i design Dg and Db etc. and why?


Without having seen something like CUDA before all these things about threads, blocks, and grids are confusing. Once you understand it though, it’s pretty neat. Let me try to help.

Threads, as you probably already understand, is the smallest unit of task that CUDA handles. Since CUDA is a massively parallel computing platform, you have to have lots of threads that exectute the same instructions on different data. So one of the most important CUDA design decision is to figure out how you can split all your tasks into threads. You have to understand threads in the context of blocks however.

A block is a group of threads that can communicate and synchornize with each other. This is the most important aspects of a block.

This characteristic is due to the fact that a block runs on a single multiprocessor. So if you have a bunch of threads that periodically need data from other threads, they have to be put into the same block. The only way (as far as I know) in which you can get one block to talk to another block is to finish the kernel execution at one point so that all the blocks write out their data to global memory, and at the next kernel invocation all the blocks can read data from all other blocks.

So you’re probably wondering why can’t everything be one big gigantic block in which every thread can talk to every other thread?

There are several reasons why Nvidia didn’t choose to make things that way.

One of the main reasons is that different GPUs have differing amounts of multi-processors in them but CUDA programs need to run smoothly on any and all of them. The only way to do that is to introduce the concept of a block so that on GPUs with only a few multi-processors, multiple blocks are run serially (that is, one after another) whereas on a powerful GPU, many blocks can be run in parallel.

The other reason is probably due to scale of circuitry. Threads in a block can do a number of things threads in different blocks can’t do.

For example, threads in a block can access very fast shared memory simultaneously. In order for multiple threads to access same shared memory simultanously, there must be a certain circuitry overhead…which becomes too large if the block becomes too large.

Now for Grids. A grid is a collection of blocks. It enables multiple blocks to execute in one kernel invocation.

So if you have a big parallel problem, you break it into blocks and arrange them into a grid.

Taking your 5x5 matrix multiply problem, if I were you, I would assign a thread to multiplying one row of the left matrix with one column of the right matrix. There will be 25 threads because there are 25 combinations of rows and columns. Your block will contain 25 threads.

If you only want to calculate one 5x5 matrix, you will need only one block. If you want to multiply a 100 pairs of 5x5 matrices,

you should make a grid consisting of 100 blocks, each block containing 25 threads.

I hope this explanation makes sense.


thanks for your answers. it made a big sense. now one more question about grid. like same problem (5x5 matrix multiplication) when i need a second grid?

for example in programming guide for matrix addition

// Kernel invocation

dim3 dimBlock(16, 16);

dim3 dimGrid((N + dimBlock.x – 1) / dimBlock.x,

(N + dimBlock.y – 1) / dimBlock.y);

matAdd<<<dimGrid, dimBlock>>>(A, B, C);

and again for same problem

// Kernel invocation

dim3 dimBlock(N, N);

matAdd<<<1, dimBlock>>>(A, B, C);

they re same problem and different grid dimension. but why do we need more grid for this problem.

thanks really for your answer