According to Cuda documentation, “Floating-point square root is implemented as a reciprocal square root followed by a reciprocal”.
This surprises me, since sqrt(x) is usually implemented as rsqrt(x)*x, not 1/rsqrt(x).
I first thought it was a mistake in the documentation, but analysis of the G80 code using Decuda (great tool by the way, thanks wumpus) reveals that it is actually implemented that way.
As a reciprocal is both more expensive and less accurate than a multiplication, I was wondering why it is done like this. The only reasons I could think of are:
- if the parameter x of sqrt is not used afterwards, it saves a register;
- in an application doing a lot of muls and adds that are independent of the call to sqrt, the hardware can overlap these with the reciprocal computation, making the reciprocal basically free.
However, in the few test cases I tried, rsqrt(x) * x was always faster than sqrt(x) by at least 1 cycle (and 16 cycles in some cases).
Does someone has another explanation?
Or are there real-world applications where 1/rsqrt(x) is faster?