Why only half-warp?

Hello.

I think it is a silly question but I didn’t figure out what is the answer. Why only half-warp is considered when measuring coalesced/uncoalesced memory access, divergent branching, …?

No one has fully explained this, but we assume it is because (in the pre-Fermi world) the half-warp is some kind of low level scheduling unit in the hardware. At the software level, we have to treat the warp as the unit of scheduling, but the hardware is free to have slightly finer granularity if that makes sense for the implementation.

At any given instant of time only 16 threads (in a warp) will get to execute on the SMs. This is the main reason for the half-warp to be considered as a unit for memory accesses. (applies only to pre-fermi)
Now, if each of those threads access 4B of data (which is common), the total access size would be 16 * 4 = 64B. Internally, the architecture is intelligent to look out for all those 16 accesses and combine them if they conform to “certain patterns”.

Hi teju,

I didn’t understand this: “At any given instant of time only 16 threads (in a warp) will get to execute on the SMs.”.

The S.M has 8 S.P, thus it isn’t capable of executing only 8 threads at the same time?

The answer for this questions will help me a lot!

Hi teju,

I didn’t understand this: “At any given instant of time only 16 threads (in a warp) will get to execute on the SMs.”.

The S.M has 8 S.P, thus it isn’t capable of executing only 8 threads at the same time?

The answer for this questions will help me a lot!

The stream processors are pipelined, so in fact many warps are in various stages of execution at any given time. The job of the scheduler on the multiprocessor is to grab warps that are not waiting on global memory reads and stuff them into the pipeline to begin executing their next instruction. Although a multiprocessor can complete an entire warp instruction (with some exceptions) every 4 clock cycles, it in fact takes many more than 4 clock cycles for a given warp instruction from beginning to end.

Every modern CPU works this way, except single-threaded code is much more likely to have “pipeline hazards”, where the next instruction in the thread depends on the one before it in such a way that you can’t stuff it into the pipeline next. By encouraging large numbers of independent instructions (i.e., threads don’t usually talk to each other), a CUDA device can keep pipelines full without all the instruction reordering fanciness (and therefore transistor cost) of a CPU.

The stream processors are pipelined, so in fact many warps are in various stages of execution at any given time. The job of the scheduler on the multiprocessor is to grab warps that are not waiting on global memory reads and stuff them into the pipeline to begin executing their next instruction. Although a multiprocessor can complete an entire warp instruction (with some exceptions) every 4 clock cycles, it in fact takes many more than 4 clock cycles for a given warp instruction from beginning to end.

Every modern CPU works this way, except single-threaded code is much more likely to have “pipeline hazards”, where the next instruction in the thread depends on the one before it in such a way that you can’t stuff it into the pipeline next. By encouraging large numbers of independent instructions (i.e., threads don’t usually talk to each other), a CUDA device can keep pipelines full without all the instruction reordering fanciness (and therefore transistor cost) of a CPU.