I’ve done some tests with sqrt() and sqrtf(), and there didn’t seem to be any performance difference between them. With the -use_fast_math flag off, they were both the same speed, and with -use_fast_math flag on, they both sped up by the same amount. What is the difference meant to be between sqrt() and sqrtf(), and is it possible to specify in the code which one you want to use, regardless of the -use_fast_math flag?
I am also a little confused that the -use_fast_math flag is in the “Host” settings for the CUDA compiler. Isn’t this specifically controlling the code that is generated to run on the Device?
sqrt(double) is not affected by -use_fast_math, as this flag only applies to single-precision computation. Due to C++ function overloading, sqrt(float) and sqrtf(float) are one and the same function and thus are equally affected by -use_fast_math.
The motivation behind -use_fast_math was primarily that it provides a mode in which functional semantics and performance are similar to what programmers were used to from NVIDIA’s Cg shader language. In the initial phase of CUDA much thought was spent on how to ease the transition to CUDA, in order to develop “critical mass” among developers quickly.
I do not understand your comment about -use_fast_math being a “Host” setting. The flag affects device code, and to my knowledge there are no effects on host code. I checked the output of nvcc --help and found -use_fast_math in the following generic section that does not seem to be specific to either host or device:
Options for steering cuda compilation
=====================================
[...]
--use_fast_math (-use_fast_math)
Make use of fast math library. --use_fast_math implies --ftz=true --prec-div=false
--prec-sqrt=false --fmad=true.
In my code, there are some place where I would like to use a fast sqrt(), where performance is critical, but accuracy is not critical, but there are other places where I would like to use the more accurate version, even though it is slower. In both cases, I am working with floats, not doubles. I guess I could cast the float to a double to force the more accurate version, but I’m assuming that sqrt(double) is another performance step slower than the non-fast-math version of sqrt(float), right?
I had thought that sqrtf() was a fast version of sqrt(), but if they are the same function, then I am mistaken. Is there a fast version of sqrt() that I can specifically call, regardless of the -use_fast_math flag?
As for the host setting, I’m just talking about the location in the visual studio cuda compiler settings. You have to go to Solution Properties, Configuration Properties, CUDA C/C++, Host, and the fast math setting is there. I would have expected it to be under Solution Properties, Configuration Properties, CUDA C/C++, Device. But you have confirmed for me that it only affects device code, so I know I am looking at the right setting, even though it is in a confusing place.
The overloading, sqrtf() vs sqrt() stuff is straight from C++, not specific to CUDA. in fact CUDA relies on the hosts math.h for standard prototypes. I am not aware of any specific intrinsic that always provides an approximate square root. There is however an __fsqrt_rn() intrinsic that always provides an IEEE-rounded single-precision square root. So I see two possible approaches:
(1) Compile your code with -use_fast_math, and call the __fsqrt_rn() intrinsic where ever you need an accurate square root.
(2) Build your own fast single-precision square root (for example x*rsqrtf(x); note: will no give desired result for x=0). Compile the code with default settings, providing accurate square root by default and call your own function wherever you want the fast approximate version.
I am ignorant of the Visual Studio IDE, a cause of much trouble for many programmers that I like to stay away from. I am a makefile / commandline kind of guy. You may want to file an enhancement request with NVIDIA is you think -use_fast_math is listed in a misleading section of the IDE.