How more exactly a thread is executed on GPU

cuda_lens · March 3, 2017, 2:49pm

Hey boys,

I have some problems in understanding how cuda works with threads.
Here is a picture about how i see things.

[url]http://i63.tinypic.com/i1grnp.png[/url]

Can somebody explain to me how exactly works?
I don’t know if i am right but here is how i see things…
Each grid is executed by a SM, after that warp scheduler takes the grid and divide this in warps with 32 threads and each warp is executed a time.
I don’t know what does SP more exactly…each SP execute a single thread from a warp?
Is possible to execute more warps a time?

Soory for my english and bad explanation. I am new with CUDA and i try to learn as much i can.
It will be just great if can someone what is happening when a grid is executed by a SM from the start to the end.
Thank you very much for your help.

SPWorley · March 3, 2017, 4:19pm

A better diagram is in the CUDA programming guide.

Think of unlayering each levels of abstraction… they’re easier to understand seperately.

You launch a kernel with a grid of blocks of work. The GPU executes your code on all the blocks and returns.

How does the GPU execute the blocks of the grid? It gives one or more blocks to each SM and tells it to work. When the SM finishes a block, the GPU gives it another to work on. When all blocks are done, the GPU returns.

How does the SM execute a block? The blocks have one or more warps. There may be more than one block, but the SM basically makes a big queue of all the warps from all the blocks. It takes one warp from the queue every one tick of the clock and executes it for one clock. The next tick, it executes the next warp on the queue (it doesn’t have to wait for the first warp to finish!) and on the next tick, another warp, and so on. Warps that finish their one tick of computation get put back onto the queue to wait for their chance to evaluate their next instruction. A warp can take many ticks of latency, even hundreds, to finish, especially if they’re waiting for memory. When all the warps from one block are done, the SM tells the kernel and may get a new block.

How is a warp executed? A warp is 32 threads wide. The warp is executed for one instruction (well, it could be two from dual-issue, but ignore that). Say the instruction is a “C=A+B”. Then the 32 threads each read “A” and “B” from registers, and the 32 SPs are given those 32 A and B values, and do the add. So the SPs are “doing the work”… they all perform the same instruction on each thread’s data.

cuda_lens · March 6, 2017, 7:40am

Very nice explanation :D
Also if let’s say i have 192 SP and each warp need a single tick of clock to execute.
That means the rest of 160 SP stay and not execute instructions? It will use just 32 SP ? :)
Or is possible to run more warps at once ?

SPWorley · March 6, 2017, 2:46pm

Opening up the abstraction a bit, the SM does not have one queue of warps, it has four queues. In Maxwell and Pascal each queue has its own 32 SPs. Each queue runs independently and schedules one of its warps to its own SPs in isolation. All 4 queues run concurrently so all 128 SPs are usually busy.

Kepler had a more complex system where pairs of queues could “share” an extra set of 32 SPs, so there were 192 SPs per SM. This was not as efficient as Maxwell/Pascal’s simpler SM architecture, mostly since there was not enough register bandwidth to feed all the SPs 3-argument operations (like FMA) every clock.

cuda_lens · March 6, 2017, 3:05pm

That’s nice. I have understand the ideea.
I have spend many time reading the documentation and i have more questions then answers about how GPU works. :))I hope you will not be mad for this stupid questions.
Why threads need to have 3 dimensions? Also block and grid?
Why do i need to use something like this:

dim3 threads(16,16,1);
kernel<<<nr_block,threads)>>>

And not just simply

int threads = 16*16;
kernel<<<nr_block,threads >>>

tera · March 6, 2017, 5:57pm

Indeed there is no strict “need” for multidimensional blocks/grids.
However, since the GPUs provide the feature anyway, it is exposed in CUDA for convenience and to save a few potentially expensive division/modulo operations.

LongY · March 6, 2017, 6:48pm

One possible reason is that having more dimensions allows more thread allocation since each dimension has a maximum limit. ([url]http://docs.nvidia.com/cuda/cuda-c-programming-guide/#compute-capabilities[/url] Table 13. Technical Specifications per Compute Capability.) For example, If you want to launch more than 1024 threads per block, you have to launch a 2 dimensional block since the maximum maximum x- or y-dimension of a block is 1024. For example, if you want to launch more than 65535 thread blocks (e.g., 65536) in compute capability 2.x GPU, you have to launch a 2 dimensional grid since the maximum thread blocks in either x, y, and z dimension is 65535.

SPWorley · March 6, 2017, 8:39pm

The maximum number of threads per block is 1024 on all current GPUs. Using 2D thread bounds won’t help.

LongY · March 6, 2017, 11:34pm

Thanks to SPWorley for pointing out the error. I edited my previous comment.

cuda_lens · March 7, 2017, 8:37am

That’s very nice. Now is much clear about how things work.
I will continue reading and if i have other questions “i will be back” :))
Thank you for your help guys :)

Topic		Replies	Views
CUDA execution mapping onto GPUs CUDA Programming and Performance	0	2818	March 2, 2009
help me understand cuda CUDA Programming and Performance	4	6882	February 10, 2010
Thread Scheduling Concept CUDA Programming and Performance	3	3721	June 21, 2012
thread, warp, block, grid, device CUDA Programming and Performance	3	6505	November 25, 2016
CUDA threads and warps Teaching and Curriculum Support	3	7845	May 12, 2015
How do CUDA cores on a SM execute warps concurrently? CUDA Programming and Performance	8	28716	July 4, 2019
Warps - Number of threads running concurrently CUDA Programming and Performance	4	2173	March 19, 2011
the 1024 threads can work concurrently? CUDA Programming and Performance	4	851	July 24, 2017
Warp scheduling - have I got this right? CUDA Programming and Performance	17	12170	February 12, 2013
Simple summary of CUDA execution model An attempt to simplify and summarize various sources on execu CUDA Programming and Performance	7	5565	July 28, 2009

How more exactly a thread is executed on GPU

Related topics