Threads, Blocks & Grid in CUDA

sanf · September 30, 2011, 7:06am

Hi All,

 How the threads are divided into blocks & grids. And how to use these threads in program's instructions?

For example, I’ve an array with 100 integer numbers. I want to add 2 to each element.
So this adding function could be the CUDA kernel.

My understanding is, this kernel has to be launched using 100 threads. Each thread will handle one element.

How to assign each array index to a CUDA thread?

The kernel instruction will be something like:(as seen from documents)

index = threadidx + BlockDim.x * BlockIdx.x …

Can some one explain practically how the 100 elements gets assigned threads & blocks?

Thanks

Gaszton · October 1, 2011, 4:04pm

RTFM

cuda C programming guide
cuda C Best Practices guide

pasoleatis · October 2, 2011, 3:26pm

Hello,

Each thread will give a unique index. In order to make all elements add you would submit something like this

vector_add<<<1,100>>

This would mean one block with 100 threads

or you could submit

vector_add<<<100,1>>

this would be 100 blocks with 1 thread 1.

In practice you submit it like this

vector_add<<<N1,N2>> with N1*N2>=100 and you tune N1 and N2 for speed/efficiency.

I suggest you to read the book CUDA by example it is a very good book and it will take you 2-3 days. It explains all this in a very good way.

Cristian

RTFM_boy · October 2, 2011, 5:51pm

http://developer.download.nvidia.com/compute/cuda/4_0/toolkit/docs/CUDA_C_Programming_Guide.pdf

http://developer.download.nvidia.com/compute/cuda/4_0_rc2/toolkit/docs/CUDA_C_Best_Practices_Guide.pdf

alrikai · October 4, 2011, 2:47am

I think the fundamental flaw in your understanding is assuming a coupling between the data and the threads. Each grid consists of a number of blocks (which can have up to 3 dimensions). Each block in turn is composed of threads (which also can have up to 3 dimensions). You have to specify the grid and block dimensions when you invoke your kernel. Also as a word of advice, aim for multiples of 32 (you might not care why yet, but it’s because warps (the smallest atomic unit of threads to execute your code) are in groups of 32 threads). Inside your kernel, each thread will execute your function, and are independent of any data you’re passing in to your function (i.e. there’s no tight coupling between threads and data). You can figure out the thread identities based on its threadIdx and blockIdx values, and use this value in your function logic (if it needs it).

Also, there’s no mandate that you need 1 thread per array element. Depending on your application (and yours has very low arithmetic complexity) it might be more efficient to NOT have 1 thread/element.

Topic		Replies	Views
Threads begginer question CUDA Programming and Performance	8	8146	July 16, 2007
Confused about number of threads, block, grid... My first CUDA app CUDA Programming and Performance	2	2440	October 9, 2009
Complete Novice Question Question on the basic implementation of a kernel CUDA Programming and Performance	6	4381	October 27, 2009
conceptual doubt about CUDA CUDA Programming and Performance	3	749	June 2, 2015
blocks and threads CUDA Programming and Performance	3	4183	November 17, 2008
understading Number of blocks and threads CUDA Programming and Performance	5	1661	April 23, 2010
Question about Block and Thread Organization dimBlock.x, dimBlock.y, dimGrid, dimBlock CUDA Programming and Performance	9	14658	April 22, 2012
questions about the NVIDIA programming model and GPU architecture newbie in here.... CUDA Programming and Performance	3	2484	November 10, 2008
Quick Thread Question Regarding Calling a kernel CUDA Programming and Performance	13	3635	June 26, 2008
Help me! CUDA Programming and Performance	5	1966	February 9, 2010

Threads, Blocks & Grid in CUDA

Related topics