Using fast_math used to be much faster on arch 2.0 and 3.0 but is now even slower on arch 3.5 and up !

Hi folks,

Strange problem :

I have a (single precision) Nbody code which is about 3 times faster if I run on arch 2.0 or 3.0 and compile with --use_fast_math. However , on arch 3.5 and 5.0 it is actually SLOWER than without -use_fast_math !!

I really don’t understand what the reason is for this. It seems to be slower on all 3.5 and 5.0 cards : I used a small K620 and a very big Titan X and both have the same problem.

Anybody any suggestions ? What changes in arch 3.5 or 5.0 can be responsible for this and how can I get the same performance improvement on arch 3.5 and up as I had on 3.0 ?

Best regards,
Kees Lemmens, Delft.

Extremely unusual (meaning I have never come across a case like this in 10+ years of CUDA programming), so I am not even going to speculate. Seems impossible to diagnose without access to a minimal, complete, and verifiable example that reproduces the issue: [url]http://stackoverflow.com/help/mcve[/url]

Given that it is an Nbody-type application, I assume there is no iterative solver involved whose convergence depends on the accuracy of intermediate results (less accurate results with --use_fast_math → more iterations → longer execution time).

Hi Njuffa,

It is a simple O(N^2) method I use already for years in my Cuda courses.
Here is the inner loop : each particle is assigned a thread and computes the force contribution of the other particles in the same block while the particle info is stored in shared memory.
Computations are all in single precision.

typedef float real_t;
#define LAMBDA (real_t)6.6730e-11 /* gravitational constant */

real_t ac,r,r2;

{
r = sqrtf(r2);
#ifdef USEFASTMATH
ac = __fdividef(LAMBDA * massm[c], r2);
at.x += __fdividef(ac * d.x , r);
at.y += __fdividef(ac * d.y , r);
#else
ac = LAMBDA * massm[c] / r2;
at.x += ac * d.x / r;
at.y += ac * d.y / r;
#endif
}

On all Cuda devices so far (I started about 8 years ago so you are 2+ years ahead :-) ) using --use_fast_math would give a factor 2-3 improvement on the standard code (so without USEFASTMATH).

However, we recently replaced the cards in the labroom with K620s and now it suddenly changed to being slightly slower ! The same on our Titan X system (all arch 5.0).

I changed the code to the part above (using the USEFASTMATH macro) and that gave a slight improvement compared to standard code but nowhere near the factor 2-3 we had before.

I also tried explicitly “-prec-div=false -ftz=false -prec-sqrt=false --fmad=true” in the Makefile but to no avail. The arch 5.0 aren’t faster using fast math than without, while arch 20 and 30 are still 2-3 times faster.

Compiling output :
nvcc -DDEBUG=1 -Xcompiler “-Wall -Wswitch -Wformat -Wchar-subscripts -Wparentheses -Wmultichar -Wtrigraphs -Wpointer-arith -Wcast-align -Wreturn-type -Wno-unused-function” -D_DEBUG -G -I. -I/opt64/cuda_6.5.14/include -I/opt64/cuda_6.5.14/samples/common/inc -DUNIX -D_GLIBCXX_GCC_GTHR_POSIX_H -g -Xptxas -v -O -prec-div=false -ftz=false -prec-sqrt=false --fmad=true -DUSEFASTMATH=1 -arch=sm_20 -o parbody_card-psm1.o -dc parbody_card-psm1.cu
ptxas info : 128 bytes gmem, 48 bytes cmem[14]
ptxas info : Function properties for _Z8ldebug_diPKci
24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Compiling entry function ‘_Z13parbodyKernelP8_paramsdP3CrdS2_PfPc’ for ‘sm_20’
ptxas info : Function properties for _Z13parbodyKernelP8_paramsdP3CrdS2_PfPc
24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 46 registers, 16016 bytes smem, 72 bytes cmem[0]
nvcc -o parbody-psm1-fastmath parbody_card-psm1.o parbody_host-pl.o -L/opt64/cuda_6.5.14/lib64 -L/opt64/cuda_6.5.14/lib -L/opt64/cuda_6.5.14/samples/lib -L/opt64/cuda_6.5.14/samples/common/lib -lcudart -lGL -lGLU -lglut -L. -ljgraphlib_x86_64 -arch=sm_20

K600 result on a small problem of 1024 particles :

~/Parallel/Cuda/Nbody-GPU 244 % ./parbody-psm1-fastmath TF/1024grid.par
Parallel particle simulation using straightforward PP: distribute all particles over the processes.
Using 4 blocks for 1024 bodies …
Run time parallel on GPU : 2.995042 secs for 400 iterations.

~/Parallel/Cuda/Nbody-GPU 245 % ./parbody-psm1-nofastmath TF/1024grid.par
Parallel particle simulation using straightforward PP: distribute all particles over the processes.
Using 4 blocks for 1024 bodies …
Run time parallel on GPU : 9.913860 secs for 400 iterations.

K620 result on a small problem of 1024 particles :

~/Parallel/Cuda/Nbody-GPU 247 % ./parbody-psm1-nofastmath TF/1024grid.par
Parallel particle simulation using straightforward PP: distribute all particles over the processes.
Using 4 blocks for 1024 bodies …
Run time parallel on GPU : 3.370911 secs for 400 iterations.

prz03:~/Parallel/Cuda/Nbody-GPU 248 % ./parbody-psm1-fastmath TF/1024grid.par
Parallel particle simulation using straightforward PP: distribute all particles over the processes.
Using 4 blocks for 1024 bodies …
Run time parallel on GPU : 2.746861 secs for 400 iterations.

============================================================

Remarkable is that if I do NOT use -DUSEFASTMATH=1 it is even slower, while one would expect that it wouldn’t make any difference as the --use_fast_math flag should make exactly the same replacements as the -DUSEFASTMATH=1 macro.

~/Parallel/Cuda/Nbody-GPU 251 % ./parbody-psm1-fastmath TF/1024grid.par
Parallel particle simulation using straightforward PP: distribute all particles over the processes.
Using 4 blocks for 1024 bodies …
Run time parallel on GPU : 3.051782 secs for 400 iterations.

BTW : I also recompiled for arch sm_35 and arch_50 but that doesn’t change the ratio fastmath/nofastmath (although the code seems to be slightly faster in general).

Hopes this gives you a little more info.

Best,
Kees

Note: Don’t compile with -arch=sm_20 when you intend to run on sm_35 or sm_50 devices. That imposes unnecessary restrictions on the compiler, and it exposes you to the possibility of JIT compilation overhead at application run-time. Instead, specify all particular architectures required to run on all relevant target devices (i.e. build a fat binary). See CUDA documentation.

Without code that I can compile (to look at the generated machine code) and run (to check with a profiler where time is spent), I am afraid I am unable to assist, and the additional information doesn’t help as long as I cannot reproduce your observations (I have an sm_50 platform with CUDA 7.5 here, so with buildable code, I should be able to do that). For all I know, the bottleneck may not be where you think it is.

Code generated for the device function intrinsics used in fast math is very unlikely to have changed. Non-fast-math code paths (such as IEEE compliant division) likely benefit from additional code optimizations introduced over the years, although I can’t readily see a scenario where this would result in a 3x performance increase. Also, the device-independent LLVM-derived front-end of the CUDA compiler incorporates many code clever transformations, and there could be a scenario where it can optimize through well-known standardized IEEE math operations (e.g. when doing constant propagation), but cannot do so for device-specific intrinsics it knows nothing about.

The above is all wild speculation, of course, and I have no ideas how it would combine to result in a 3x performance difference.

Unless there were recent changes, --use_fast_math does precisely three things: (1) set -ftz=true (2) set -prec-div=false -prec-sqrt=false (3) replace certain math operations with their device intrinsic counterparts (see list in the CUDA documentation). You can inspect the intermediate PTX representation and also the SASS machine code with cuobjdump [–dump-ptx | --dump_sass] to see what happens to your code when it is compiled.

For what it’s worth I added enough scaffolding code to your snippet to make it compile, and looked at the generated code when compiling with CUDA 7.5 for sm_50:

Both PTX and SASS look like I would expect them to. Use of -use_fast_math gives additional speedup over #define USEFASTMATH 1 because it replaces the IEEE 754-compliant, properly-rounded standard math function sqrtf() with the approximate device intrinsic.

It is not a good idea to use debug builds for any kind of performance measurements, as the results will generally be meaningless. I notice belatedly that the nvcc invocation shown above indicates a debug build, because it contains

-D_DEBUG -G

You’re remarks about debugging, sm_20 and fat-bins are quite correct but are not very relevant for this particular problem. For 5 years the code was 3 times faster with use_fast_math than without and now on arch 5 it isn’t. That has nothing to do with the debugging and/or sm_20, sm_35 or sm_50 (FYI I tried these all: same behaviour).

And the reason that I compile for sm_20 is that the labroom isn’t homogeneous and the oldest cards only support sm_20. I do use fat-bin for other examples, but didn’t consider it useful here as I do not use any new tricks like dynamic parallellism or concurrent kernels.

To put it simple: initially I didn’t change the code whatsoever and had a 3 fold speedup and now it is gone.

Another mystery : manually introducing __fdividef now makes it slightly faster (20 %) than before but that SHOULDN’T make any difference as use_fast_math would introduce these optimisations anyway ! It is almost as if use_fast_math is not fully reflected anymore. I’ll try to look at the PTX code and maybe I’ll also try to make a simplified example to see if the problem persists there as well.

BTW: I compile relocatable code now using “-dc” as this is required for the later dynamic parallellism examples. Could this have any impact on fast math optimisations ?

The flags you showed above indicate a debug build. Performance measurements for debug builds can be all over the map, there is just nothing worth comparing with debug builds as such performance comparisons are meaningless. For example, debug builds could easily indicate lower performance when moving to newer CUDA versions, as the generated code may have additional modifications to improve “debuggability”.

I would suggest the following: Build the code in release mode (without -D_DEBUG -G), for sm_35 or sm_50 as appropriate for the target GPU. Now compare the execution times. Unless your actual code is memory bound, you should see the following (which is what I see on my sm_50 device with my scaffolding):

(1) Build with #undef USEFASTMATH (slowest)
(2) Build with #define USEFASTMATH (faster)
(3) Build with #define USEFASTMATH and nvcc command line flag -use_fast_math (fastest)

Use of code from separate compilation / relocatable code usually leads to lower performance as this necessarily disables some optimizations applied in whole-program compilation, such as function inlining (as of yet, there is no optimizing linker in the CUDA toolchain). This may in turn have negative impact on other optimization, such as constant propagation through functions calls, early load scheduling etc. Other than constant propagation, I can’t think of any such optimization that directly impacts math operations or functions.

Thanks for your reply : I’ll build a code without debugging and see if it changes.

But the question about your point (3) still remains : why would I need (my) USEFASTMATH if I already compile with --use_fast_math ? It shouldn’t make any difference IMO as all divisions should be already be replaced with __fdividef anyway ?

I didn’t look at that case:

(4) Build with #undef USEFASTMATH and nvcc command line flag -use_fast_math

I would think that when you try that in release builds, there will be no significant (> 2% noise level) performance difference between case (3) and case (4). If you find otherwise, a look at the SASS (which is what actually runs on the machine) should tell us what the difference is in terms of code generation.

[Later:] Come to think of it, there is a relevant difference between #define USEFASTMATH and use of -use_fast_math with respect to the divisions: The latter turns on flush-to-zero mode, the former does not. FTZ mode allows further simplification of the division code, so -use_fast_math should give a small performance benefit relative to the use of #define USEFASTMATH.

Orthogonal to the discussion of -use_fast_math and sm level compile flags, the code itself is likely more efficient by eliminating the three expensive divisions and folding the normalization into the ac multiplier, assuming ac is not used later in the code.

{
  float oneOverR=rsqrtf(r2); 
  float norm=LAMBDA*oneOverR*oneOverR*oneOverR*massM[c]; // r^(-3) 
  at.x += norm*d.x;
  at.y += norm*d.y;
}

I also like putting array evaluations like massM[c] at the end of multiplication or FMA chains just out of precaution that the compute might need to wait on the array indirection. C++ compilers (including CUDA’s ) are not allowed (by default) to reorder floating point evaluations since rounding can make the evaluation order dependent. Putting constant and register arguments at the begining of the chain lets the compiler potentially evaluate those multiplies while the wait for the massM[c] memory fetch is pending. This won’t be a savings in many GPU cases where you have other warps to execute anyway during the stall, but it won’t ever hurt. It’s just a good coding reflex, especially for CPU code, to think about what parts of an evaluation might stall and arrange work for the GPU/CPU to do while waiting.

Note sure whether this applies here, but for vector normalization I would suggest looking at the following two functions: rhypot{f} for 2D vectors, and rnorm3d{f} for 3D vectors. They were specifically designed for this task.

If ones needs neither overflow/underflow protection nor good accuracy, these may not always be the fastest choice for vector normalization, but they are definitely worth considering for a baseline implementation.

As for the scheduling optimizations mentioned by SPWorley, I would suggest using ‘const restrict’ pointers as aggressively as possible to increase the mobility of loads relative to other code, as loads from memory have much higher latency than arithmetic. See also the Best Practices Guide.

Norbert, are those normalization functions actually more efficient? I always thought that their real use was the extra logic to avoid overflow (or underflow) when forming the intermediate r^2 value.

Bonus trivia from a 25 year old graphics gems book: a first order crude approximation to the hypotenuse of (fabs) |X| |Y| is H~= max(X,Y)+1/2 min(X,Y).
In 3D, H~= max(X,Y,Z) + 0.34median(X,Y,Z) + 0.25min(X,Y,Z).

I couldn’t say exactly how efficient rhypot() and rnorm3d() are these days. I know multiple rounds of optimizations were applied to those functions, I was responsible for one of those rounds. They should beat approaches based on division :-), and should be fairly close in performance to the quick & dirty version (but slightly slower), since NVIDIA understands that performance is important for these two functions.

The original idea behind rhypot() was to provide for fast but accurate and robust Given’s rotations in QR decomposition. No sooner did that functionality ship in CUDA that there were requests for a 3D version, which became rnorm3d().

I understand that some CUDA programmers seek performance at all costs, I am more in the camp of trying to get robust and accurate solutions at the highest performance possible.

Hi,

Problem seems to be solved : I used sqrt(r2) instead of sqrtf(r2) in
the lab examples while my variables where actually single precision.
fast math will always use the single precision code (afaik) so gave an
enormous improvement of a factor 3.

I guess that arch 3.5 and 5.0 cards are more clever in using single
precision sqrtf if the argument is SP, so then my enormous improvement
got lost.

The code on my own system was already fixed long ago, resulting in
different results if not using fast math.

After applying the suggested improvement by SPWorley I now have a
consistent 10-15 % improvement when using fast math. Not so impressive
as before but probably more realistic :-)

oneovrr = myrsqrt(r2); /* = 1/r /
acdivr = LAMBDA * massm[c] * oneovrr
oneovrroneovrr; / = acr^21/r^3 = ac/r */
at.x += acdivr * d.x;
at.y += acdivr * d.y;

One more remark :

According to the Cuda manuals using Fast Math may slightly increase
the amount of registers being used. This may result in pushing you
over a threshold if you already have many threads per block and many
registers per thread, with the result that some thread variables have
to be stored in local memory instead of register memory, which may
give you a huge performance penalty.

Thanks for all the help !

Best,
Kees

Could you provide the exact reference from current documentation? I am not saying this is incorrect, but off the top of my head I can’t think of cases where this might apply and why. Normally I would expect --use_fast_math to lower register use, due to fewer temporary variables being required to evaluate device function intrinsics vs the standard math functions.

From e.g. : http://sailfish.us.edu.pl/_sources/performance.txt

Using fast math functions
^^^^^^^^^^^^^^^^^^^^^^^^^
CUDA code can use a faster, but less precise version of several common mathematical
functions (e.g. transcendental functions such as sine, cosine, square root or the exponential function).
These so-called intrinsic functions will be used if the fast math mode is turned on, which can be done
using the --cuda-nvcc-opts=--use_fast_math command line option. This might slightly
increase the speed of some of the more complex LB models. If you decide to apply this
optimization, watch out for degraded precision (always run regression tests of our simulation)
and increased register usage.

But I do not know if it still applies ?

The document you linked is not part of CUDA documentation, but looks like release notes for a GPU-port of third-party software? I doesn’t even specifically say that register usage may increase with --use_fast_math, just gives generic advice to watch out for it.

True, it sounds unlikely to me as well but I stumbled upon it while searching for clues. Didn’t check for more references, as I should have done of course :-)

On the other hand : the people who made Sailfish CFD are not just a bunch of amateurs and they did explicitly mention it in their General tuning strategies , section Performance Tuning for Fast Math. Wonder where they got the idea ?

From own experience I would say that just because someone is a professional, or even an expert, does not mean they are always right :-) That is the reason there is some risk in “deferring to authority” in science and engineering.

I don’t know how old this write-up is, but I am hypothesizing that the background of this comment may be due to the following: Even small changes to CUDA source code can be sufficient to change the generated machine code, and this in turn can easily cause register usage to fluctuate by +/- 2 registers. With older GPU architectures, registers were a very tight resource (and register allocation and spilling not as optimized as today), so these natural fluctuations could show up as application level performance differences.

With modern GPU architectures, registers are in relatively abundant supply, and the compilers are much more advanced (e.g. change to the LLVM framework) and generally more mature. So small source code changes and small fluctuations in register ussage are unlikely to make a noticeable effect on performance. In general, all three of constituent parts of --use_fast_math (turning on FTZ mode, using approximate square root and division, use of intrinsics for certain math functions) should lead to a reduction in register use, making any negative performance effects from the use of --use_fast_math extremely unlikely.

But nothing is impossible, so if anyone has a worked example that shows performance degradation from use of --use_fast_math with CUDA version >= 7.5 and compute capability >= 3.0 I would be interested to hear about details of that software.