Conditional code and the instructions number threshold when replacing with predicated instructions

LowLevelKB · March 27, 2014, 1:54pm

Hello everybody.

I have the question for people familiar with the lowest, hardware-level way the multiprocessor executes the divergent warps. In the NVIDIA CUDA Programming Guide we read:

I wonder, what does this value come from. I’ve poked a lot in the internet resources but so far, I haven’t found nothing very useful. I want You to tell me, if my conjectures are correct. Before I present them, let’s consider the following snippet of the PTX code computing the roots of the quadratic equation. For brevity, we’ll restrain ourselves to the G80 architecture.

mul.f32 r0, b, b;
mul.f32 r1, a, c;
mad.f32 r0, r1, -4.0, r0;
setp.lt.f32 p0, r0, 0.0;
@p0 bra label0;

neg.f32 r1, b;
rsqrt.approx.f32 r0, r0;
rcp.approx.f32 r0, r0;
rcp.approx.f32 r2, a;
mul.f32 r2, 0.5;
sub.f32 x1, r1, r0;
mul.f32 x1, x1, r2;
add.f32 x2, r1, r0;
mul.f32 x2, r1, r2;

label0:

Now, suppose that the delta >= 0 for exactly 8 threads of the warp and the delta < 0 for the remaining 24 ones. Here are my conjectures:

Because G80 has only 8 ALU cores and 2 SFU and we have only 8 threads with delta >= 0, the instructions of the execution path connected to the code snippet between the bra instruction and the label0 need 18 cycles to be issued by the warp dispatcher + some overhead connected to the serialization of the execution pathes, while it's predicated counterpart exactly 4 times more (i.e. 72 cycles), because instructions of the entire threads of the warp have to be dispatched (but, of course, 24 of them don't execute and write their results) - 4 cycles for one ALU instruction (1 cycle for 8 threads) and 16 cycles for one SFU instruction (1 cycle for 2 threads).

neg.f32 r1, b; (ALU - 1 cycle)
rsqrt.approx.f32 r0, r0; (SFU - 4 cycles)
rcp.approx.f32 r0, r0; (SFU - 4 cycles)
rcp.approx.f32 r2, a; (SFU - 4 cycles)
mul.f32 r2, 0.5; (ALU - 1 cycle)
sub.f32 x1, r1, r0; (ALU - 1 cycle)
mul.f32 x1, x1, r2; (ALU - 1 cycle)
add.f32 x2, r1, r0; (ALU - 1 cycle)
mul.f32 x2, r1, r2; (ALU - 1 cycle)

Warp dispatcher issues each ALU instruction of the execution path in 4 cycles and the SFU instruction in 16 cycles, no matter how much threads are in fact active and the overhead of the predicated counterpart arises from the fact, that predicated instruction is translated into more instructions at the SASS level (2.1) or it simply needs more cycles to be issued (2.2).

Or maybe there is any other reason?

Thanks in advance.

Topic		Replies	Views
[Solved] PTX ISA predicated execution and the warp divergence issue CUDA Programming and Performance	6	3100	January 14, 2014
Evaluation of complex conditions Do threads diverge ? CUDA Programming and Performance	1	2748	August 24, 2008
Predicate propagation CUDA Programming and Performance	3	1276	January 5, 2010
branching and SIMD processor serialization vs predication CUDA Programming and Performance	7	10747	October 26, 2007
Cost of bra instruction CUDA Programming and Performance	8	7862	January 14, 2010
Is there efficient way to deal with if/else in the kernel CUDA Programming and Performance	4	14150	June 14, 2009
How many divergent branches can actually be discussed in parallel? CUDA Programming and Performance	5	3063	October 1, 2009
About divergent warps CUDA Programming and Performance	3	1616	September 22, 2009
Questions about control structure CUDA Programming and Performance	1	914	June 17, 2010
Ternary operators and branching CUDA Programming and Performance	3	9041	May 3, 2009

Conditional code and the instructions number threshold when replacing with predicated instructions

Related topics