sqrt(), sqrtf() and use_fast_math

surfdabbler · March 10, 2015, 12:45am

I’ve done some tests with sqrt() and sqrtf(), and there didn’t seem to be any performance difference between them. With the -use_fast_math flag off, they were both the same speed, and with -use_fast_math flag on, they both sped up by the same amount. What is the difference meant to be between sqrt() and sqrtf(), and is it possible to specify in the code which one you want to use, regardless of the -use_fast_math flag?

I am also a little confused that the -use_fast_math flag is in the “Host” settings for the CUDA compiler. Isn’t this specifically controlling the code that is generated to run on the Device?

njuffa · March 10, 2015, 1:08am

sqrt(double) is not affected by -use_fast_math, as this flag only applies to single-precision computation. Due to C++ function overloading, sqrt(float) and sqrtf(float) are one and the same function and thus are equally affected by -use_fast_math.

The motivation behind -use_fast_math was primarily that it provides a mode in which functional semantics and performance are similar to what programmers were used to from NVIDIA’s Cg shader language. In the initial phase of CUDA much thought was spent on how to ease the transition to CUDA, in order to develop “critical mass” among developers quickly.

I do not understand your comment about -use_fast_math being a “Host” setting. The flag affects device code, and to my knowledge there are no effects on host code. I checked the output of nvcc --help and found -use_fast_math in the following generic section that does not seem to be specific to either host or device:

Options for steering cuda compilation
=====================================

[...]

--use_fast_math                                    (-use_fast_math)             
        Make use of fast math library. --use_fast_math implies --ftz=true --prec-div=false
        --prec-sqrt=false --fmad=true.

surfdabbler · March 10, 2015, 2:06am

Thanks for your response.

In my code, there are some place where I would like to use a fast sqrt(), where performance is critical, but accuracy is not critical, but there are other places where I would like to use the more accurate version, even though it is slower. In both cases, I am working with floats, not doubles. I guess I could cast the float to a double to force the more accurate version, but I’m assuming that sqrt(double) is another performance step slower than the non-fast-math version of sqrt(float), right?

I had thought that sqrtf() was a fast version of sqrt(), but if they are the same function, then I am mistaken. Is there a fast version of sqrt() that I can specifically call, regardless of the -use_fast_math flag?

As for the host setting, I’m just talking about the location in the visual studio cuda compiler settings. You have to go to Solution Properties, Configuration Properties, CUDA C/C++, Host, and the fast math setting is there. I would have expected it to be under Solution Properties, Configuration Properties, CUDA C/C++, Device. But you have confirmed for me that it only affects device code, so I know I am looking at the right setting, even though it is in a confusing place.

njuffa · March 10, 2015, 2:18am

The overloading, sqrtf() vs sqrt() stuff is straight from C++, not specific to CUDA. in fact CUDA relies on the hosts math.h for standard prototypes. I am not aware of any specific intrinsic that always provides an approximate square root. There is however an __fsqrt_rn() intrinsic that always provides an IEEE-rounded single-precision square root. So I see two possible approaches:

(1) Compile your code with -use_fast_math, and call the __fsqrt_rn() intrinsic where ever you need an accurate square root.

(2) Build your own fast single-precision square root (for example x*rsqrtf(x); note: will no give desired result for x=0). Compile the code with default settings, providing accurate square root by default and call your own function wherever you want the fast approximate version.

I am ignorant of the Visual Studio IDE, a cause of much trouble for many programmers that I like to stay away from. I am a makefile / commandline kind of guy. You may want to file an enhancement request with NVIDIA is you think -use_fast_math is listed in a misleading section of the IDE.

Topic		Replies	Views
Improve time usage of mathematical functions CUDA Programming and Performance	1	2596	August 18, 2009
Help understanding sqrt functions in CUDA CUDA Programming and Performance	2	5059	May 11, 2012
Using fast_math used to be much faster on arch 2.0 and 3.0 but is now even slower on arch 3.5 and up ! CUDA Programming and Performance	19	2390	October 31, 2016
Fastmath functions Speed or accuracy CUDA Programming and Performance	8	21651	April 16, 2009
Implementation of sqrt CUDA Programming and Performance	3	11139	November 8, 2007
Sqrt and Pow on CUDA CUDA Programming and Performance	2	28087	July 8, 2010
sqrt function in CUDA kernel function call fails CUDA Programming and Performance	2	15369	November 5, 2007
-use_fast_math CUDA Programming and Performance	2	7849	August 26, 2010
Weirdness with toggling the --use_fast_math compilation flag CUDA 7.5 CUDA Programming and Performance	2	1099	October 7, 2016
Huge instruction stream for reciprocal on CC 2.0 reciprocal operation side effect? CUDA Programming and Performance	8	5507	September 13, 2011

sqrt(), sqrtf() and use_fast_math

Related topics