A more accurate, performance-competitive implementation of expf()

I’ve made a few tests, and it seems that, unfortunately, ptxas doesn’t seem to use new scheduling freedom much (tends to emit a chain of dependent fma insns although fcmp/f2i/etc are available), but nevertheless the FCMP path with ‘–fmad false’ closely matches the ICMP path in performance.

Also, you can use ‘__fadd_rn’ to explicitly specify that you want correct rounding and thus suppress the undesirable optimization of ‘j > 0’:

j = __fadd_rn (fmaf (1.442695f, a, 12582912.f), -12582912.f);

(i.e. this way you don’t need to add command-line option ‘–fmad false’ to get good codegen, at the cost of reduced code readability)

Default compilation uses -fmad=true (FMA contraction allowed) for performance reasons, so -fmad=false is not very relevant to performance-sensitive code. The switch -fmad=false was introduced fairly late (CUDA 4.0?), with the goal of being able to easily address customer complaints that they were not getting identical results to their CPU code. So it is mostly a “debug” feature. Most of the time, the use of FMA actually gave customers more accurate results, but they did not want those :-)

I am aware of the __fadd_rn() trick in general (years ago, I suggested this to the NVIDIA compiler folks, to give programmers some control over unwanted contractions by explicitly specifying a rounding mode on adds, subtracts, and multpilies), but I had not looked into this in this particular context, as I do not recall seeing the specific code you described above.

I work on this code (and various other ones) on and off as it strikes my fancy. I want to have fun, and not make this into work (been there, done that, got a rack full of t-shirts). I may revisit this code at a future point to try other coding alternatives that may be more portable, while retaining performance.

I tweaked the core approximation to further increase the percentage of correctly-rounded results.

Applied yet another tweak to the core approximation to further increase the percentage of correctly-rounded results.

Minor tweak to core approximation to improve overall accuracy. Changed special case handling to make performance the same between compilation with -ftz=true and -ftz=false. Using nvcc 8.0.60 on Win64 with Quadro K2200 the throughput is now 35.1e9 function calls in either mode.