Conditional code and the instructions number threshold when replacing with predicated instructions

Hello everybody.

I have the question for people familiar with the lowest, hardware-level way the multiprocessor executes the divergent warps. In the NVIDIA CUDA Programming Guide we read:

I wonder, what does this value come from. I’ve poked a lot in the internet resources but so far, I haven’t found nothing very useful. I want You to tell me, if my conjectures are correct. Before I present them, let’s consider the following snippet of the PTX code computing the roots of the quadratic equation. For brevity, we’ll restrain ourselves to the G80 architecture.

mul.f32 r0, b, b;
mul.f32 r1, a, c;
mad.f32 r0, r1, -4.0, r0;
setp.lt.f32 p0, r0, 0.0;
@p0 bra label0;

neg.f32 r1, b;
rsqrt.approx.f32 r0, r0;
rcp.approx.f32 r0, r0;
rcp.approx.f32 r2, a;
mul.f32 r2, 0.5;
sub.f32 x1, r1, r0;
mul.f32 x1, x1, r2;
add.f32 x2, r1, r0;
mul.f32 x2, r1, r2;

label0:

Now, suppose that the delta >= 0 for exactly 8 threads of the warp and the delta < 0 for the remaining 24 ones. Here are my conjectures:

  • Because G80 has only 8 ALU cores and 2 SFU and we have only 8 threads with delta >= 0, the instructions of the execution path connected to the code snippet between the bra instruction and the label0 need 18 cycles to be issued by the warp dispatcher + some overhead connected to the serialization of the execution pathes, while it's predicated counterpart exactly 4 times more (i.e. 72 cycles), because instructions of the entire threads of the warp have to be dispatched (but, of course, 24 of them don't execute and write their results) - 4 cycles for one ALU instruction (1 cycle for 8 threads) and 16 cycles for one SFU instruction (1 cycle for 2 threads).
    neg.f32 r1, b; (ALU - 1 cycle)
    rsqrt.approx.f32 r0, r0; (SFU - 4 cycles)
    rcp.approx.f32 r0, r0; (SFU - 4 cycles)
    rcp.approx.f32 r2, a; (SFU - 4 cycles)
    mul.f32 r2, 0.5; (ALU - 1 cycle)
    sub.f32 x1, r1, r0; (ALU - 1 cycle)
    mul.f32 x1, x1, r2; (ALU - 1 cycle)
    add.f32 x2, r1, r0; (ALU - 1 cycle)
    mul.f32 x2, r1, r2; (ALU - 1 cycle)
    
  • Warp dispatcher issues each ALU instruction of the execution path in 4 cycles and the SFU instruction in 16 cycles, no matter how much threads are in fact active and the overhead of the predicated counterpart arises from the fact, that predicated instruction is translated into more instructions at the SASS level (2.1) or it simply needs more cycles to be issued (2.2).
  • Or maybe there is any other reason?

    Thanks in advance.