Simple summary of CUDA execution model An attempt to simplify and summarize various sources on execu

I’ve been studying the NVidia documentation and journal publications to build a simple picture in my mind of how execution actually proceeds on a CPU using CUDA. I’ve run some timing experiments to test my ideas, but I do still have some questions (italics below) which I hope others might help with. At any rate, I offer this to others learning the system and I welcome corrections.

Execution summary

[indent]SM is streaming multiprocessor; SP is streaming processor or core.
CC is compute capability.
SFU - special function unit
DP - double precision unit[/indent]
Blocks are the first level of execution granularity. A block is assigned to a single SM. An SM executes up to 8 blocks in parallel, depending on other constraints. Blocks are not assigned or executed in any specified order.

The warp is the next level of execution granularity. Blocks are split into warps of 32 threads each; one warp in a block can have less than 32 threads if the block size is not an even multiple of 32. An SM can manage a pool of 32 warps (CC >= 1.2). The threads of a warp are distributed to all of the SPs in the SM. All threads of a warp execute the same instruction in parallel, so it is useful to think of the warp as the SM’s actual execution unit, and it is warps which are managed by the SM’s scheduler.

Since a given SP will usually have 4 threads, each instruction execution consumes 4 cycles (assuming no resource availability issues). In conditional branches the instruction stream is still uniform, but threads which have not met the branch criterion do not execute the instructions until the branch completes. When one or more threads of a warp are waiting on an external resource (usually memory, but also possibly the SFU or DP) the entire warp is marked as waiting and does not execute until all of its threads are again ready.
[indent]• It is not specified, that I know, but threads in a warp appear to have contiguous thread IDs – can this be depended upon?
• Is the order of assignment of threads within a warp to SPs specified? Breadth first or depth first? (This has bearing on the next question.)
• If all SPs contain less than 4 threads, does the instruction execution necessarily take 4 cycles? For example, if a block contains 1 thread can the instruction execution take 1 cycle?
• If all threads of a warp follow the same path of an n-way conditional branch, are the instruction streams for the other n-1 paths transmitted to the warp? (This would seem to be a necessity for the first optimization rule listed below.) [/indent]

(A half-warp is important from the point of view of optimal shared memory access, but otherwise has no special place in the execution hierarchy.)

The lowest level of execution granularity is the thread. It is the thread which, from the programmer’s point of view, actually executes the kernel. The thread identifies itself by reference to [font=“Courier New”]blockIdx[/font] and [font=“Courier New”]threadIdx[/font] structures. These structures specify a 1- to 5-dimensional coordinate space, of which up to 3 dimensions partition the possible 512 threads in any given block, and the other 2 dimensions locate the block in the grid of blocks; the programmer typically uses the thread’s location in the given space to determine the thread’s behavior, memory inputs and memory outputs. Threads within a block may communicate with each other via shared memory, and their places in the kernel’s instruction stream may be synchronized.

Some of the most important performance optimizations are determined by kernel code which:
[indent]• Allows all threads in a warp to follow the same branch paths;
• Results in global memory accesses which can be coalesced into aligned, contiguous blocks of appropriate size;
• Limits global and local(!) memory usage, using shared memory as much as possible;
• Prefers computation to memory access;
• Accesses shared memory without bank conflicts.

[/indent]

Answering your four questions in order:

  1. Yes.
  2. Assuming I’m understanding your question correctly, breadth first. All warps must be resident on the SM simultaneously to guarantee __syncthreads()’ behavior. Depth first would basically prevent shared memory from being used except as a per-warp scratchpad.
  3. It takes four cycles regardless.
  4. I’m not sure I’m understanding your question. If all threads in a warp take the same branch at a conditional, there’s not any performance penalty besides the cost of a jump.

Thanks for the very quick reply. WRT item 2: I wanted to know if, say, you had a block size of 15 – would 7 SPs be responsible for 2 threads, and 1 for 1 thread, or would 3 SPs handle 4 threads, 1 handle 3 threads, and 4 handle none? The issue is moot (for performance) given your answer to item 3, but I’m still curious.

You did answer my question on item 4.

Since warps are guaranteed to have contiguous thread IDs, I assume there’s a fixed mapping between a lane in a warp and which SP is used in which cycle for a given operation.

One comment on this discussion. It’s not correct to say, without qualification, that instructions “take” 4 cycles, and doing so can lead to misunderstandings. The instruction issue latency is 4 cycles – in other words, it takes 4 cycles to issue an instruction for a 32-thread warp (that’s because there are 8 SPs per SM).

However, like any modern processor, the SPs are pipelined. The execution latency of this pipeline depends on the instruction, but in general you can assume it is significantly longer than the 4 cycle issue latency. The MAD/ADD/MUL instructions, for example, have a 20-24 cycle latency, depending on the number of register operands. This is why you need at least 6 warps per SM to cover the arithmetic pipeline latency (6 warps * 4 cycles issue latency/warp = 24 cycles of “issuing” to cover up to 24 cycles of pipeline latency).

In general for data-parallel processors like GPUs, the most appropriate performance metric to talk about in this context is throughput, not latency. (They are sometimes called throughput processors.) The MAD throughput of an SM is 1 32-thread instruction every 4 cycles, or 8 MADs per cycle.

In the latest CUDA programming guide versions, we have changed to focus on instruction throughput for this reason. However it is important to understand that the SP pipelining means that low-occupancy SMs (fewer than 6 warps) can suffer from pipeline stalls, reducing achieved throughput.

Mark

PS, remember: higher occupancy does not necessarily equate to higher performance, but low-occupancy SMs cannot adequately hide latency.

Could you, please, confirm that a block is assigned to a single SM rather than to a single SP? I am confused because of your next statement

I probably do not understand something…

Multiple blocks can be resident on the SM at the same time depending on the number of resources required per block. It won’t be parallel in the same clock, but it will context switch between blocks.

I see. Thank you. I think I understand now how it works.