I’ve been studying the NVidia documentation and journal publications to build a simple picture in my mind of how execution actually proceeds on a CPU using CUDA. I’ve run some timing experiments to test my ideas, but I do still have some questions (italics below) which I hope others might help with. At any rate, I offer this to others learning the system and I welcome corrections.
[indent]SM is streaming multiprocessor; SP is streaming processor or core.
CC is compute capability.
SFU - special function unit
DP - double precision unit[/indent]
Blocks are the first level of execution granularity. A block is assigned to a single SM. An SM executes up to 8 blocks in parallel, depending on other constraints. Blocks are not assigned or executed in any specified order.
The warp is the next level of execution granularity. Blocks are split into warps of 32 threads each; one warp in a block can have less than 32 threads if the block size is not an even multiple of 32. An SM can manage a pool of 32 warps (CC >= 1.2). The threads of a warp are distributed to all of the SPs in the SM. All threads of a warp execute the same instruction in parallel, so it is useful to think of the warp as the SM’s actual execution unit, and it is warps which are managed by the SM’s scheduler.
Since a given SP will usually have 4 threads, each instruction execution consumes 4 cycles (assuming no resource availability issues). In conditional branches the instruction stream is still uniform, but threads which have not met the branch criterion do not execute the instructions until the branch completes. When one or more threads of a warp are waiting on an external resource (usually memory, but also possibly the SFU or DP) the entire warp is marked as waiting and does not execute until all of its threads are again ready.
[indent]â€¢ It is not specified, that I know, but threads in a warp appear to have contiguous thread IDs – can this be depended upon?
â€¢ Is the order of assignment of threads within a warp to SPs specified? Breadth first or depth first? (This has bearing on the next question.)
â€¢ If all SPs contain less than 4 threads, does the instruction execution necessarily take 4 cycles? For example, if a block contains 1 thread can the instruction execution take 1 cycle?
â€¢ If all threads of a warp follow the same path of an n-way conditional branch, are the instruction streams for the other n-1 paths transmitted to the warp? (This would seem to be a necessity for the first optimization rule listed below.) [/indent]
(A half-warp is important from the point of view of optimal shared memory access, but otherwise has no special place in the execution hierarchy.)
The lowest level of execution granularity is the thread. It is the thread which, from the programmer’s point of view, actually executes the kernel. The thread identifies itself by reference to [font=“Courier New”]blockIdx[/font] and [font=“Courier New”]threadIdx[/font] structures. These structures specify a 1- to 5-dimensional coordinate space, of which up to 3 dimensions partition the possible 512 threads in any given block, and the other 2 dimensions locate the block in the grid of blocks; the programmer typically uses the thread’s location in the given space to determine the thread’s behavior, memory inputs and memory outputs. Threads within a block may communicate with each other via shared memory, and their places in the kernel’s instruction stream may be synchronized.
Some of the most important performance optimizations are determined by kernel code which:
[indent]â€¢ Allows all threads in a warp to follow the same branch paths;
â€¢ Results in global memory accesses which can be coalesced into aligned, contiguous blocks of appropriate size;
â€¢ Limits global and local(!) memory usage, using shared memory as much as possible;
â€¢ Prefers computation to memory access;
â€¢ Accesses shared memory without bank conflicts.