"Half-warps", scheduling, and branch divergence

MatthewMP · February 24, 2013, 5:39pm

On compute capability 2.x architecture, the Programming Guide says:

“A warp scheduler can issue an instruction to only half of the CUDA cores. To execute
an instruction for all threads of a warp, a warp scheduler must therefore issue the
instruction over two clock cycles for an integer or floating-point arithmetic instruction.”

Simple scenario:

Suppose that for a given warp, I assign one task to the first 16 threads, and another to the second 16 threads. e.g.

if (threadIdx % 32 < 16)
{  funcA();  }
else
{  funcB();  }

Question:

Will this still result in warp divergence due to branching? Does the above Programming Guide quote about issuing instructions to only half of the cores at a time mean that the scheduler can give one set of instructions to one half-warp (“funcA”) and another set of instructions to the second half-warp (“funcB”) simultaneously? i.e. “instruction parallelism by half-warp” ?

seibert · February 24, 2013, 6:06pm

No, divergence is a problem if you branch at the half-warp level. When a warp instruction is executed on compute capability 2.x, the entire warp is sent to 16 of the CUDA cores. CUDA cores are pipelined (as are all modern CPUs), so what happens is that the 32 threads are queued up in two consecutive pipeline stages in those 16 CUDA cores. The instruction takes something like 16 to 24 clock ticks to propagate through the pipeline, but because many instructions moving through the pipeline at once, that group of 16 CUDA cores will complete one warp every two clock ticks.

If you aren’t familiar with pipelining in computer architecture, this is a reasonable summary:

[url]http://en.wikipedia.org/wiki/Pipeline_(computing)[/url]

MatthewMP · February 24, 2013, 9:07pm

Thank you, seibert. So the half-warp talk is only a detail of computing/architecture. Thus, from the developer’s prospective, the detail about scheduling half-waps, etc., is not a way to get additional parallelism. In fact, it should have no impact whatsoever on the code you write. Right?

seibert · February 24, 2013, 10:48pm

Correct. The scheduler works with entire warps, and so that is how you should think about your code.

One thing to keep in mind is that the warp size on all CUDA devices so far is 32, but NVIDIA has said that it may change in the future. If you need to use the size of a warp in your device code somewhere, then you should use the integer variable “warpSize,” which is provided by the compiler.