Implementation of sqrt

Hello everyone,

According to Cuda documentation, “Floating-point square root is implemented as a reciprocal square root followed by a reciprocal”.
This surprises me, since sqrt(x) is usually implemented as rsqrt(x)*x, not 1/rsqrt(x).

I first thought it was a mistake in the documentation, but analysis of the G80 code using Decuda (great tool by the way, thanks wumpus) reveals that it is actually implemented that way.

As a reciprocal is both more expensive and less accurate than a multiplication, I was wondering why it is done like this. The only reasons I could think of are:

  • if the parameter x of sqrt is not used afterwards, it saves a register;
  • in an application doing a lot of muls and adds that are independent of the call to sqrt, the hardware can overlap these with the reciprocal computation, making the reciprocal basically free.

However, in the few test cases I tried, rsqrt(x) * x was always faster than sqrt(x) by at least 1 cycle (and 16 cycles in some cases).

Does someone has another explanation?
Or are there real-world applications where 1/rsqrt(x) is faster?

Thanks,
Sylvain

You need to think like a graphics programmer. In graphics, you often need to normalize a 3-vector to a length of 1, even in a pixel shader. This operation is done so often that NVIDIA implements rsqrt as the native instruction, and probably saves transistors by not implementing a sqrt at the hardware level.

Edit: sorry, I read your question too quickly and didn’t really answer what you asked. I’m not sure why sqrt() was done that way. I leave the original post for any wondering why there isn’t a real sqrt() in the first place.

CUDA’s sqrtf() is implemented as 1.0f/rsqrtf(x) for correctness.
Consider sqrtf(0.0f):

1.0f / rsqrtf(0.0f) = 1.0f / infinity = 0.0f

Using the approach x*rsqrtf(x):

0.0f * rsqrtf(0.0f) = 0.0f * infinity = NaN

The same problem exists for sqrtf(infinity). Handling inputs of
zero and infinity seperately would be slower than going through
the reciprocal which gives the correct answer easily.

This makes sense.
Thank you for the explanation.

Actually, this solution might turn out being faster in many cases than using a multiplication if instruction scheduling in ptxas was improved.

It looks like the hardware can overlap special function computations with ALU instructions, but not enough in the case of two successive dependent special instructions (as in sqrtf, __exp2f, __sinf, __cosf…) In my tests at least 5 cycles/warp per sqrtf call are lost compared to an optimal instruction scheduling.

My guess is that interleaving ALU instructions with special instructions would improve the effective throughput. Doing this optimization after register allocation should not cause any negative side effect.

Of course I do not know how much an improvement it would make in real codes.