Why only half-warp?

The stream processors are pipelined, so in fact many warps are in various stages of execution at any given time. The job of the scheduler on the multiprocessor is to grab warps that are not waiting on global memory reads and stuff them into the pipeline to begin executing their next instruction. Although a multiprocessor can complete an entire warp instruction (with some exceptions) every 4 clock cycles, it in fact takes many more than 4 clock cycles for a given warp instruction from beginning to end.

Every modern CPU works this way, except single-threaded code is much more likely to have “pipeline hazards”, where the next instruction in the thread depends on the one before it in such a way that you can’t stuff it into the pipeline next. By encouraging large numbers of independent instructions (i.e., threads don’t usually talk to each other), a CUDA device can keep pipelines full without all the instruction reordering fanciness (and therefore transistor cost) of a CPU.