Simple summary of CUDA execution model An attempt to simplify and summarize various sources on execu

wfsmith · July 22, 2009, 11:18pm

I’ve been studying the NVidia documentation and journal publications to build a simple picture in my mind of how execution actually proceeds on a CPU using CUDA. I’ve run some timing experiments to test my ideas, but I do still have some questions (italics below) which I hope others might help with. At any rate, I offer this to others learning the system and I welcome corrections.

Execution summary

[indent]SM is streaming multiprocessor; SP is streaming processor or core.
CC is compute capability.
SFU - special function unit
DP - double precision unit[/indent]
Blocks are the first level of execution granularity. A block is assigned to a single SM. An SM executes up to 8 blocks in parallel, depending on other constraints. Blocks are not assigned or executed in any specified order.

The warp is the next level of execution granularity. Blocks are split into warps of 32 threads each; one warp in a block can have less than 32 threads if the block size is not an even multiple of 32. An SM can manage a pool of 32 warps (CC >= 1.2). The threads of a warp are distributed to all of the SPs in the SM. All threads of a warp execute the same instruction in parallel, so it is useful to think of the warp as the SM’s actual execution unit, and it is warps which are managed by the SM’s scheduler.

Since a given SP will usually have 4 threads, each instruction execution consumes 4 cycles (assuming no resource availability issues). In conditional branches the instruction stream is still uniform, but threads which have not met the branch criterion do not execute the instructions until the branch completes. When one or more threads of a warp are waiting on an external resource (usually memory, but also possibly the SFU or DP) the entire warp is marked as waiting and does not execute until all of its threads are again ready.
[indent]â€¢ It is not specified, that I know, but threads in a warp appear to have contiguous thread IDs – can this be depended upon?
â€¢ Is the order of assignment of threads within a warp to SPs specified? Breadth first or depth first? (This has bearing on the next question.)
â€¢ If all SPs contain less than 4 threads, does the instruction execution necessarily take 4 cycles? For example, if a block contains 1 thread can the instruction execution take 1 cycle?
â€¢ If all threads of a warp follow the same path of an n-way conditional branch, are the instruction streams for the other n-1 paths transmitted to the warp? (This would seem to be a necessity for the first optimization rule listed below.) [/indent]

(A half-warp is important from the point of view of optimal shared memory access, but otherwise has no special place in the execution hierarchy.)

The lowest level of execution granularity is the thread. It is the thread which, from the programmer’s point of view, actually executes the kernel. The thread identifies itself by reference to [font=“Courier New”]blockIdx[/font] and [font=“Courier New”]threadIdx[/font] structures. These structures specify a 1- to 5-dimensional coordinate space, of which up to 3 dimensions partition the possible 512 threads in any given block, and the other 2 dimensions locate the block in the grid of blocks; the programmer typically uses the thread’s location in the given space to determine the thread’s behavior, memory inputs and memory outputs. Threads within a block may communicate with each other via shared memory, and their places in the kernel’s instruction stream may be synchronized.

Some of the most important performance optimizations are determined by kernel code which:
[indent]â€¢ Allows all threads in a warp to follow the same branch paths;
â€¢ Results in global memory accesses which can be coalesced into aligned, contiguous blocks of appropriate size;
â€¢ Limits global and local(!) memory usage, using shared memory as much as possible;
â€¢ Prefers computation to memory access;
â€¢ Accesses shared memory without bank conflicts.

[/indent]

tmurray · July 22, 2009, 11:26pm

Answering your four questions in order:

Yes.
Assuming I’m understanding your question correctly, breadth first. All warps must be resident on the SM simultaneously to guarantee __syncthreads()’ behavior. Depth first would basically prevent shared memory from being used except as a per-warp scratchpad.
It takes four cycles regardless.
I’m not sure I’m understanding your question. If all threads in a warp take the same branch at a conditional, there’s not any performance penalty besides the cost of a jump.

wfsmith · July 22, 2009, 11:50pm

Thanks for the very quick reply. WRT item 2: I wanted to know if, say, you had a block size of 15 – would 7 SPs be responsible for 2 threads, and 1 for 1 thread, or would 3 SPs handle 4 threads, 1 handle 3 threads, and 4 handle none? The issue is moot (for performance) given your answer to item 3, but I’m still curious.

You did answer my question on item 4.

tmurray · July 23, 2009, 5:02pm

Since warps are guaranteed to have contiguous thread IDs, I assume there’s a fixed mapping between a lane in a warp and which SP is used in which cycle for a given operation.

Mark_Harris · July 24, 2009, 1:01am

One comment on this discussion. It’s not correct to say, without qualification, that instructions “take” 4 cycles, and doing so can lead to misunderstandings. The instruction issue latency is 4 cycles – in other words, it takes 4 cycles to issue an instruction for a 32-thread warp (that’s because there are 8 SPs per SM).

However, like any modern processor, the SPs are pipelined. The execution latency of this pipeline depends on the instruction, but in general you can assume it is significantly longer than the 4 cycle issue latency. The MAD/ADD/MUL instructions, for example, have a 20-24 cycle latency, depending on the number of register operands. This is why you need at least 6 warps per SM to cover the arithmetic pipeline latency (6 warps * 4 cycles issue latency/warp = 24 cycles of “issuing” to cover up to 24 cycles of pipeline latency).

In general for data-parallel processors like GPUs, the most appropriate performance metric to talk about in this context is throughput, not latency. (They are sometimes called throughput processors.) The MAD throughput of an SM is 1 32-thread instruction every 4 cycles, or 8 MADs per cycle.

In the latest CUDA programming guide versions, we have changed to focus on instruction throughput for this reason. However it is important to understand that the SP pipelining means that low-occupancy SMs (fewer than 6 warps) can suffer from pipeline stalls, reducing achieved throughput.

Mark

PS, remember: higher occupancy does not necessarily equate to higher performance, but low-occupancy SMs cannot adequately hide latency.

v2005t · July 28, 2009, 7:02pm

Could you, please, confirm that a block is assigned to a single SM rather than to a single SP? I am confused because of your next statement

I probably do not understand something…

tmurray · July 28, 2009, 7:45pm

Multiple blocks can be resident on the SM at the same time depending on the number of resources required per block. It won’t be parallel in the same clock, but it will context switch between blocks.

v2005t · July 28, 2009, 8:23pm

I see. Thank you. I think I understand now how it works.

Topic		Replies	Views
GPU architecture and CUDA kernel execution CUDA Programming and Performance	13	24834	September 6, 2009
How do CUDA cores on a SM execute warps concurrently? CUDA Programming and Performance	8	28635	July 4, 2019
Basic question about warps CUDA Programming and Performance	14	6571	June 9, 2009
Warp scheduling - have I got this right? CUDA Programming and Performance	17	12107	February 12, 2013
warp and core What's the relationship between warp and core? CUDA Programming and Performance	12	15556	February 4, 2011
Blocks/Warps/Threads Allocation I have some doubts about the allocation of blocks/warps/thread in CU CUDA Programming and Performance	5	2570	November 1, 2012
How many parallel threads? CUDA Programming and Performance	19	9861	October 1, 2021
How more exactly a thread is executed on GPU CUDA Programming and Performance	9	2962	March 7, 2017
Thread Scheduling Concept CUDA Programming and Performance	3	3679	June 21, 2012
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4489	October 24, 2008

Simple summary of CUDA execution model An attempt to simplify and summarize various sources on execu

Related topics