"Half-warps", scheduling, and branch divergence

No, divergence is a problem if you branch at the half-warp level. When a warp instruction is executed on compute capability 2.x, the entire warp is sent to 16 of the CUDA cores. CUDA cores are pipelined (as are all modern CPUs), so what happens is that the 32 threads are queued up in two consecutive pipeline stages in those 16 CUDA cores. The instruction takes something like 16 to 24 clock ticks to propagate through the pipeline, but because many instructions moving through the pipeline at once, that group of 16 CUDA cores will complete one warp every two clock ticks.

If you aren’t familiar with pipelining in computer architecture, this is a reasonable summary:

[url]http://en.wikipedia.org/wiki/Pipeline_(computing)[/url]