Threads, Blocks & Grid in CUDA

Hi All,

 How the threads are divided into blocks & grids. And how to use these threads in program's instructions?

For example, I’ve an array with 100 integer numbers. I want to add 2 to each element.
So this adding function could be the CUDA kernel.

My understanding is, this kernel has to be launched using 100 threads. Each thread will handle one element.

How to assign each array index to a CUDA thread?

The kernel instruction will be something like:(as seen from documents)

index = threadidx + BlockDim.x * BlockIdx.x …

Can some one explain practically how the 100 elements gets assigned threads & blocks?



cuda C programming guide
cuda C Best Practices guide


Each thread will give a unique index. In order to make all elements add you would submit something like this


This would mean one block with 100 threads

or you could submit


this would be 100 blocks with 1 thread 1.

In practice you submit it like this

vector_add<<<N1,N2>> with N1*N2>=100 and you tune N1 and N2 for speed/efficiency.

I suggest you to read the book CUDA by example it is a very good book and it will take you 2-3 days. It explains all this in a very good way.


I think the fundamental flaw in your understanding is assuming a coupling between the data and the threads. Each grid consists of a number of blocks (which can have up to 3 dimensions). Each block in turn is composed of threads (which also can have up to 3 dimensions). You have to specify the grid and block dimensions when you invoke your kernel. Also as a word of advice, aim for multiples of 32 (you might not care why yet, but it’s because warps (the smallest atomic unit of threads to execute your code) are in groups of 32 threads). Inside your kernel, each thread will execute your function, and are independent of any data you’re passing in to your function (i.e. there’s no tight coupling between threads and data). You can figure out the thread identities based on its threadIdx and blockIdx values, and use this value in your function logic (if it needs it).

Also, there’s no mandate that you need 1 thread per array element. Depending on your application (and yours has very low arithmetic complexity) it might be more efficient to NOT have 1 thread/element.