Implementation of sqrt

Sylvain_Collange · November 7, 2007, 3:15pm

Hello everyone,

According to Cuda documentation, “Floating-point square root is implemented as a reciprocal square root followed by a reciprocal”.
This surprises me, since sqrt(x) is usually implemented as rsqrt(x)*x, not 1/rsqrt(x).

I first thought it was a mistake in the documentation, but analysis of the G80 code using Decuda (great tool by the way, thanks wumpus) reveals that it is actually implemented that way.

As a reciprocal is both more expensive and less accurate than a multiplication, I was wondering why it is done like this. The only reasons I could think of are:

if the parameter x of sqrt is not used afterwards, it saves a register;
in an application doing a lot of muls and adds that are independent of the call to sqrt, the hardware can overlap these with the reciprocal computation, making the reciprocal basically free.

However, in the few test cases I tried, rsqrt(x) * x was always faster than sqrt(x) by at least 1 cycle (and 16 cycles in some cases).

Does someone has another explanation?
Or are there real-world applications where 1/rsqrt(x) is faster?

Thanks,
Sylvain

MisterAnderson42 · November 7, 2007, 4:16pm

You need to think like a graphics programmer. In graphics, you often need to normalize a 3-vector to a length of 1, even in a pixel shader. This operation is done so often that NVIDIA implements rsqrt as the native instruction, and probably saves transistors by not implementing a sqrt at the hardware level.

Edit: sorry, I read your question too quickly and didn’t really answer what you asked. I’m not sure why sqrt() was done that way. I leave the original post for any wondering why there isn’t a real sqrt() in the first place.

mfatica · November 7, 2007, 6:19pm

CUDA’s sqrtf() is implemented as 1.0f/rsqrtf(x) for correctness.
Consider sqrtf(0.0f):

1.0f / rsqrtf(0.0f) = 1.0f / infinity = 0.0f

Using the approach x*rsqrtf(x):

0.0f * rsqrtf(0.0f) = 0.0f * infinity = NaN

The same problem exists for sqrtf(infinity). Handling inputs of
zero and infinity seperately would be slower than going through
the reciprocal which gives the correct answer easily.

Sylvain_Collange · November 8, 2007, 11:03am

This makes sense.
Thank you for the explanation.

Actually, this solution might turn out being faster in many cases than using a multiplication if instruction scheduling in ptxas was improved.

It looks like the hardware can overlap special function computations with ALU instructions, but not enough in the case of two successive dependent special instructions (as in sqrtf, __exp2f, __sinf, __cosf…) In my tests at least 5 cycles/warp per sqrtf call are lost compared to an optimal instruction scheduling.

My guess is that interleaving ALU instructions with special instructions would improve the effective throughput. Doing this optimization after register allocation should not cause any negative side effect.

Of course I do not know how much an improvement it would make in real codes.

Topic		Replies	Views
Huge instruction stream for reciprocal on CC 2.0 reciprocal operation side effect? CUDA Programming and Performance	8	5450	September 13, 2011
Help understanding sqrt functions in CUDA CUDA Programming and Performance	2	4982	May 11, 2012
CUDA doesn't represent doubles as accurately as floats CUDA Programming and Performance	2	706	July 9, 2013
Performance tweak for single-precision square root CUDA Programming and Performance	0	984	March 25, 2021
why cuda is slower than opencl CUDA Programming and Performance	7	2008	April 6, 2016
Problem with ^(1/4) CUDA Programming and Performance	10	1052	April 3, 2011
Looking for logical compute ceiling Found magic CUDA optimizations CUDA Programming and Performance	7	2942	February 2, 2010
sqrt precision CUDA Programming and Performance	3	19424	September 9, 2011
Integer square root CUDA Programming and Performance	3	1563	February 18, 2025
sqrt(), sqrtf() and use_fast_math CUDA Programming and Performance	3	13924	March 10, 2015

Implementation of sqrt

Related topics