How more exactly a thread is executed on GPU

Hey boys,

I have some problems in understanding how cuda works with threads.
Here is a picture about how i see things.

[url]http://i63.tinypic.com/i1grnp.png[/url]

Can somebody explain to me how exactly works?
I don’t know if i am right but here is how i see things…
Each grid is executed by a SM, after that warp scheduler takes the grid and divide this in warps with 32 threads and each warp is executed a time.
I don’t know what does SP more exactly…each SP execute a single thread from a warp?
Is possible to execute more warps a time?

Soory for my english and bad explanation. I am new with CUDA and i try to learn as much i can.
It will be just great if can someone what is happening when a grid is executed by a SM from the start to the end.
Thank you very much for your help.

A better diagram is in the CUDA programming guide.

Think of unlayering each levels of abstraction… they’re easier to understand seperately.

You launch a kernel with a grid of blocks of work. The GPU executes your code on all the blocks and returns.

How does the GPU execute the blocks of the grid? It gives one or more blocks to each SM and tells it to work. When the SM finishes a block, the GPU gives it another to work on. When all blocks are done, the GPU returns.

How does the SM execute a block? The blocks have one or more warps. There may be more than one block, but the SM basically makes a big queue of all the warps from all the blocks. It takes one warp from the queue every one tick of the clock and executes it for one clock. The next tick, it executes the next warp on the queue (it doesn’t have to wait for the first warp to finish!) and on the next tick, another warp, and so on. Warps that finish their one tick of computation get put back onto the queue to wait for their chance to evaluate their next instruction. A warp can take many ticks of latency, even hundreds, to finish, especially if they’re waiting for memory. When all the warps from one block are done, the SM tells the kernel and may get a new block.

How is a warp executed? A warp is 32 threads wide. The warp is executed for one instruction (well, it could be two from dual-issue, but ignore that). Say the instruction is a “C=A+B”. Then the 32 threads each read “A” and “B” from registers, and the 32 SPs are given those 32 A and B values, and do the add. So the SPs are “doing the work”… they all perform the same instruction on each thread’s data.

Very nice explanation :D
Also if let’s say i have 192 SP and each warp need a single tick of clock to execute.
That means the rest of 160 SP stay and not execute instructions? It will use just 32 SP ? :)
Or is possible to run more warps at once ?

Opening up the abstraction a bit, the SM does not have one queue of warps, it has four queues. In Maxwell and Pascal each queue has its own 32 SPs. Each queue runs independently and schedules one of its warps to its own SPs in isolation. All 4 queues run concurrently so all 128 SPs are usually busy.

Kepler had a more complex system where pairs of queues could “share” an extra set of 32 SPs, so there were 192 SPs per SM. This was not as efficient as Maxwell/Pascal’s simpler SM architecture, mostly since there was not enough register bandwidth to feed all the SPs 3-argument operations (like FMA) every clock.

That’s nice. I have understand the ideea.
I have spend many time reading the documentation and i have more questions then answers about how GPU works. :))I hope you will not be mad for this stupid questions.
Why threads need to have 3 dimensions? Also block and grid?
Why do i need to use something like this:

dim3 threads(16,16,1);
kernel<<<nr_block,threads)>>>

And not just simply

int threads = 16*16;
kernel<<<nr_block,threads >>>

Indeed there is no strict “need” for multidimensional blocks/grids.
However, since the GPUs provide the feature anyway, it is exposed in CUDA for convenience and to save a few potentially expensive division/modulo operations.

One possible reason is that having more dimensions allows more thread allocation since each dimension has a maximum limit. ([url]http://docs.nvidia.com/cuda/cuda-c-programming-guide/#compute-capabilities[/url] Table 13. Technical Specifications per Compute Capability.) For example, If you want to launch more than 1024 threads per block, you have to launch a 2 dimensional block since the maximum maximum x- or y-dimension of a block is 1024. For example, if you want to launch more than 65535 thread blocks (e.g., 65536) in compute capability 2.x GPU, you have to launch a 2 dimensional grid since the maximum thread blocks in either x, y, and z dimension is 65535.

The maximum number of threads per block is 1024 on all current GPUs. Using 2D thread bounds won’t help.

Thanks to SPWorley for pointing out the error. I edited my previous comment.

That’s very nice. Now is much clear about how things work.
I will continue reading and if i have other questions “i will be back” :))
Thank you for your help guys :)