Block size and grid size


I just start learning CUDA and I’m confuse with block/grid size. “standard” C programming with pthreads or fork() is not very difficult for me, but I don’t really understand the CUDA architecture. So I have few questions…

  1. I don’t understand how to set the right block size, the right grid size and of course the right number of threads! Is this related to hardware and/or the application ?
  2. Can we see GPU threads as CPU threads ? or the number of blocks as the number of CPU threads ?
  3. Do you have information/papers/links “comparing” the CPU approach vs the GPU approach ?

My card is a GeForce 9300M GS 512MB.

I don’t have so much mathematics backgrounds… :o)

(yes, that’s newbie questions =) )

The block and grid sizes depend on the one hand on your algorithm. On the other hand there are some restrictions due to hardware resources. Besides that you have to take care for optimal load on the GPU.

GPU and an x86 CPU are complete different architectutres. Thus CPU threads aren’t comparable to GPU threads at all.

Refer to the CUDA Programming Guide and the CUDA Technical Training



I start to understand the CUDA logic…but I have a question :

Let’s assume we want to compute a matrix 16x48 (let’s say add 1 at each element). So my matrix has 768 elements. I also know, according to the cuda programming guide, that a block can only handle 512 threads max and blocks from a grid are distributed on multiprocessors (MP).

So if I have only 1 MP, nothing to think about, I launch my kernel has follow :

func <<< 1, dim3(16,48) >>>(...)

However, if I have a card with 30 MP do I need to launch :

func <<<1, dim3(16,32) >>> (..)

for better performances? By better performance I mean a faster resolution time. (I’m thinking of making blocks which have 512 threads each).

Am I right ? (-:


One Block is executed by only one MP. BUT that does not mean that one MP can only process one block.


<<<dimGrid, dimBlock>>> is the total bunch of threads that is distributed among all MPs. It is not the Grid-Per-MP.



OK ok ok… if I launch a kernel like that :

dim3 dimBlock(16,48)

func <<< 1, dimBlock >>> (....)

I’m creating 1 grid (one by one) containing X blocks 16x48x1 isn’t it ? Where X is determined by…I don’t know =) How can I know ? does it depend on the data to be processed ?

I mean : if I want to processed a 768 elements array and I launch the kernel as follow :

dim3 dimBlock(16,32)

func <<< 1, dimBlock >>> (myArray)

does it create 2 blocks ? (16x32 = 512 => 2x512 = 1024 > 768 => 2 blocks)

Help ! =)

You should think about it in 1D first…

Suppose you want to process 768 elements, each with their own thread.

You could also do 3 blocks, with each block having 256 threads…something like

numBlocks = 3;

numThreadsPerBlock = 256;

dim3 dimGrid(numBlocks);

dim3 dimBlock(numThreadsPerBlock);

func <<< dimGrid, dimBlock >>> (myArray)

Once you have an understanding of kernel execution parameters, GPU architecture and CUDA as a whole, then I’d suggest looking into changing your block/grid dimensions…