I’m doing some performance testing to evaluate different functions in CUDA, and I have come upon the functions to calculate the square root. Here is both a normal ‘sqrtf’ and an intrinsic ‘__fsqrt_rn’.

The second is approximately three times slower. Is the only difference numerical accuracy? Or am I reading the CUDA C Programming Guide wrong?

I run the tests on a GTX480 using Cuda Toolkit 4.0.

sqrtf() is a single-precision square root function that can map either to an approximate square root implementation, or one that rounds to nearest or even according to the IEEE-754 standard.

On sm_1x devices, sqrtf() always maps to the approximate square root implementation. On sm_2x and sm_3x devices the mapping is controlled by the compiler flag -prec-sqrt={true|false}. The default setting is “true”. When -prec-sqrt=false is specified, sqrtf() maps to the approximate square root implementation, with -prec-sqrt=true it maps to the IEEE-rounded one. -use_fast_math implies -prec-sqrt=false.

__fsqrt_rn() always maps to an implementation that rounds to nearest-or-even according to the IEEE-754 standard. It is quite slow on sm_1x devices since the hardware does not support the single-precision FMA (fused multiply-add) operation which is crucial to high performance implementations of correctly rounded square root.

Even on sm_2x and sm_3x devices significant performance differences between approximate and IEEE-rounded versions can be observed, which is simply a consequence of the work necessary to guarantee the standard compliant result. Over successive generations of CUDA, a lot of work has gone into providing optimized implementations of such correctly rounded mathematical primitives.