Questions about control structure

I am wondering about control structures, like if ones.

I am aware about branching issues, this is not my question is about.

My question is: how much time does it take to do the branch? Here a simple and well known example:

int idx = blockIdx.x * blockDim.x + threadIdx.x;

// label 1

if ( idx < 512 )

// label 2

It seems obvious that the comparison is evaluated on each cores in one or few cycles. What about the resulting branching? Is the warp scheduler take 1 cycle to do the branch for each core, or does it serialize it? For a warp of 32 threads, how much cycles are there between label 1 and 2?

Maybe this provides some insight:

http://www.eecg.toronto.edu/~myrto/gpuarch-ispass2010.pdf

My take is:

  • The branch instruction(bra in ptx) is executed in parallel like the comparison instruction. The bra then updates the instruction pointer of each thread in the warp when the branch condition is met(possibly causing non-uniform instruction pointers across the warp->divergent warp). In your example, since 512 is a multiple of 32, there is no divergent warp => uniform branch => all instruction pointers in the warp are either updated or not.

  • When the warp gets scheduled for execution, it executes one of the instructions pointed to by the 32 instruction pointers. The results are discarded for each threads with differing instruction pointers(only happens for divergent branches, not in your case).

  • A per-warp stack is used to manage the reconvergence of divergent warps(each nested if needs one level of stack) => pdf.