Among the numerous transcendental functions provided by standard math libraries, the exponential function is often the most performance sensitive. In other words, this function needs to be *fast*. For this reason, CUDA’s implementation of the single-precision exponential function, expf(), makes use of the fast hardware approximation to exp2(x) provided by the multi-function unit of the GPU via the MUFU.EX2 instruction. Unfortunately, this instruction isn’t very accurate, so that CUDA’s expf() winds up with a maximum error of 2.31917 ulp.

Typically, for functions like the exponential function that serve as the building block for many higher transcendental functions, a faithfully-rounded implementation is desired, meaning the error should be less than 1 ulp. There are in fact real-life use cases where such accuracy is more or less required.

The challenge in the context of CUDA is to provide such an implementation without a negative impact on performance. I have made several attempts to create such an implementation over the years, however due to the lack of a hardware support for ldexpf() functionality (used to scale floating-point numbers) in NVIDIA GPUs it is extremely challenging (from what I know, AMD GPUs do have such an instruction).

I have finally managed to come up with code that is performance competitive with CUDA 8’s implementation of expf(), *in default compilation mode*. On my Quadro K2200 (basically like a GTX 750 Ti, but with larger memory), CUDA 8’s built-in expf() has a throughput of 33.4e9 function evaluations per second, while my replacement provides throughput of 35.1e9 function evaluations per second. It should be noted that CUDA’s built-in implementation sees significant speed-up, to 39.3e9 function evaluations per second, when -ftz=true is specified. And it will be faster by factors if -use_fast_math is specified, as this causes expf() to be replaced by the device intrinsic __expf(), with much larger error.

This means that the code below is like only of interest for those who do not use -ftz=true or -use_fast_math, and desire improved accuracy. I figured it might still be useful to *some* people, so why not post it? The error bound for my_expf() is 0.86565 ulp, meaning the function is faithfully rounded.

*Code below updated 2/8/2017, 2/17/2017, 8/3/2017, 11/19/2017, 3/11/2017, 8/2/2021*

```
/*
Copyright (c) 2015-2021, Norbert Juffa
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
__device__ float my_expf (float a)
{
float f, r, j, s, t;
int i, ia;
// exp(a) = 2**i * exp(f); i = rintf (a / log(2))
j = fmaf (1.442695f, a, 12582912.f) - 12582912.f; // 0x1.715476p0, 0x1.8p23
f = fmaf (j, -6.93145752e-1f, a); // -0x1.62e400p-1 // log_2_hi
f = fmaf (j, -1.42860677e-6f, f); // -0x1.7f7d1cp-20 // log_2_lo
i = (int)j;
// approximate r = exp(f) on interval [-log(2)/2, +log(2)/2]
r = 1.37805939e-3f; // 0x1.694000p-10
r = fmaf (r, f, 8.37312452e-3f); // 0x1.125edcp-7
r = fmaf (r, f, 4.16695364e-2f); // 0x1.555b5ap-5
r = fmaf (r, f, 1.66664720e-1f); // 0x1.555450p-3
r = fmaf (r, f, 4.99999851e-1f); // 0x1.fffff6p-2
r = fmaf (r, f, 1.00000000e+0f); // 0x1.000000p+0
r = fmaf (r, f, 1.00000000e+0f); // 0x1.000000p+0
// exp(a) = 2**i * r
ia = (i > 0) ? 0 : 0x83000000;
s = __int_as_float (0x7f000000 + ia);
t = __int_as_float ((i << 23) - ia);
r = r * s;
r = r * t;
// handle special cases: severe overflow / underflow
if (fabsf (a) >= 104.0f) r = s * s;
return r;
}
```