Hello everybody.
I have the question for people familiar with the lowest, hardware-level way the multiprocessor executes the divergent warps. In the NVIDIA CUDA Programming Guide we read:
I wonder, what does this value come from. I’ve poked a lot in the internet resources but so far, I haven’t found nothing very useful. I want You to tell me, if my conjectures are correct. Before I present them, let’s consider the following snippet of the PTX code computing the roots of the quadratic equation. For brevity, we’ll restrain ourselves to the G80 architecture.
mul.f32 r0, b, b;
mul.f32 r1, a, c;
mad.f32 r0, r1, -4.0, r0;
setp.lt.f32 p0, r0, 0.0;
@p0 bra label0;
neg.f32 r1, b;
rsqrt.approx.f32 r0, r0;
rcp.approx.f32 r0, r0;
rcp.approx.f32 r2, a;
mul.f32 r2, 0.5;
sub.f32 x1, r1, r0;
mul.f32 x1, x1, r2;
add.f32 x2, r1, r0;
mul.f32 x2, r1, r2;
label0:
Now, suppose that the delta >= 0 for exactly 8 threads of the warp and the delta < 0 for the remaining 24 ones. Here are my conjectures:
neg.f32 r1, b; (ALU - 1 cycle)
rsqrt.approx.f32 r0, r0; (SFU - 4 cycles)
rcp.approx.f32 r0, r0; (SFU - 4 cycles)
rcp.approx.f32 r2, a; (SFU - 4 cycles)
mul.f32 r2, 0.5; (ALU - 1 cycle)
sub.f32 x1, r1, r0; (ALU - 1 cycle)
mul.f32 x1, x1, r2; (ALU - 1 cycle)
add.f32 x2, r1, r0; (ALU - 1 cycle)
mul.f32 x2, r1, r2; (ALU - 1 cycle)
Or maybe there is any other reason?
Thanks in advance.