Weirdness with toggling the --use_fast_math compilation flag CUDA 7.5

Usually I can get a good boost out of the --use_fast_math compilation flag, but oddly for a rather large application this compilation flag set results in a slower run time than without, unless I play with the maxrregcount value

The main workhorse kernel verbose compilation output without the --use_fast_math compilation flag set reports a usage of 40 registers per thread, while with the flag set reports 32 registers used per thread for the kernel. This is with no attempt to set the max register usage for the compilation.

The overall running time is about 2.3% faster without the --use_fast_math flag.
If I use the --use_fast_math compilation flag and set the -maxrregcount=40, then it reports 39 registers are used and the running time is about 5% faster than the original version.

There is a slight loss of accuracy using the fast_math flag relative to a reference volume, which is not a surprise.

I would guess that the implementations of the floating point math functions use less registers with the --use_fast_math compilation flag, but what is then going on when I up the maxrregcount value for compilation?

Is that affecting some other portion of the code (each thread has a fairly large double nested loop), or can it also affect the math floating point implementations of the basic 32 bit floating point operations(multiply, divide,addition/subtraction, sqrtf())?

As far as performance differences go, based on extensive experience, I would consider a +/- 2% difference to be within noise level, meaning a difference of 2.3% is barely outside that cut-off.

The difference may well come to down to scheduling artifacts introduced by smallish changes in the generated machine code caused by use of --use_fast_math. Scheduling of load instructions in particular may be responsible. A reduction in register use with --use_fast_math is not unusual, as various operations, in particular various math functions use simpler code requiring fewer temporary variables.

Lastly, --use_fast_math turns on -ftz=true, which speeds up some operations, but may slow down others (overall it is typically a win). If memory serves, certain floating-point conversions are slower when FTZ is turned on. The reason for this is that the hardware supports both FTZ and denormal modes for some operation (e.g. FFMA, FMUL, FADD), FTZ mode only for others (e.g. MUFU.EX2, MUFU.LG2), and denormal mode only for floating-point conversions. This non-orthogonality of the hardware is fixed at PTX level, requiring short inlined emulation sequences for the missing flavor.

Do you see the same slow-down effect for multiple architectures (e.g. Kepler and Maxwell), or just a particular architecture?

I am afraid to really get to the bottom of this rather small discrepancies, one would have to look at the generated SASS in detail, with / without --use_fast_math, and with / without -maxrregcount.

I have a (single precision) Nbody code which is about 3 times faster if I run on arch 2.0 or 3.0 and compile with --use_fast_math. However , on arch 3.5 and 5.0 it is actually SLOWER than without -use_fast_math !!

I really don’t understand what the reason is for this. It seems to be slower on all 3.5 and 5.0 cards : used a small K620 and a very big Titan X.

Anybody any suggestions ? What changes in arch 3.5 or 5.0 can be responsible for this and how I get the same performance improvement on arc 3.5 and up ?

Best,
Kees Lemmens, Delft.