A faster and more accurate implementation of expm1f()

njuffa · February 23, 2017, 8:39pm

expm1(x) = exp(x) - 1, but accurate for |x| near zero. The function is most frequently used as a building block in financial computations and hyperbolic functions such as sinh().

When compared to CUDA 8’s built-in implementation, the implementation below achieves a throughput of 20.54e9 function evaluations per second on a Quadro K2200 (basically a GTX 750 Ti with larger memory) vs 18.85e9 for the implementation of CUDA 8, for a performance advantage of 9%. The maximum error is reduced from 1.44464 ulp in CUDA 8 to 1.00978 ulp. It would be nice if the function were faithfully rounded, i.e. maximum error < 1 ulp, but so far I have only managed to come close, with about a dozen inputs triggering error in excess of 1 ulp.

Given that expm1() isn’t all that frequently use, this is probably of limited utility, but after almost two years of tinkering with this every now and then I figured I might as well post the code. The code has been tested exhaustively for all possible inputs on a GPU with compute capability 5.0 using default compilation mode. Adjustments may be necessary when using other compilations modes or when porting to non-GPU platforms.

Code below superseded by the version in post #2 below.

/* [s] Copyright (c) 2015-2017, Norbert Juffa All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ /* Compute exponential base e minus 1. Maximum ulp error = 1.00978 i = rint(a/log(2)), f = a-i*log(2). Then expm1(a) = 2**i * (expm1(f)+1) - 1. Compute r = expm1(f). Then expm1(a)= 2 * (0.5 * 2**i * r + 0.5 * 2**i - 0.5). t = 0.5*2**i. Thus expm1(a) = 2*(t + fma(r, t, -0.5)) or 2*fma(r, t, t-0.5). When i == 1, expm1(a) = 2*(r + 0.5). NOTE: Scale factor b is only applied if i < 0 or i > 1 (should be power of 2) */ __device__ float my_expm1f_scaled_unchecked (float a, float b) { float f, j, r, t, u; int i; // exp(a) = 2**i * exp(f); i = rintf (a / log(2)) j = fmaf (1.442695f, a, 12582912.f); // 0x1.715476p0, 0x1.8p23 i = __float_as_int (j); j = j - 12582912.f; // 0x1.8p23 f = fmaf (j, -6.93145752e-1f, a); // -0x1.62e400p-1 // log_2_hi f = fmaf (j, -1.42860677e-6f, f); // -0x1.7f7d1cp-20 // log_2_lo // approximate r = exp(f)-1 on interval [-log(2)/2, +log(2)/2] r = 1.98543072e-4f; r = fmaf (r, f, 1.39400735e-3f); r = fmaf (r, f, 8.33349116e-3f); r = fmaf (r, f, 4.16664928e-2f); r = fmaf (r, f, 1.66666716e-1f); r = fmaf (r, f, 5.00000000e-1f); t = (j == 1) ? (f + 0.5f) : f; r = fmaf (r, f, 0.0f); r = fmaf (r, f, t); // Scale result from primary approximation range to full range u = 0.5f * b; t = __int_as_float ((i << 23) + __float_as_int (u)); // b*2**(i-1) t = (j > 17) ? (fmaf (r, t, -u) + t) : fmaf (r, t, t - u); if (j == 1) t = r; if (j != 0) r = t + t; return r; } /* Compute exponential base e minus 1. max ulp err = 1.00978 */ __device__ float my_expm1f (float a) { float r; r = my_expm1f_scaled_unchecked (a, 1.0f); /* handle severe overflow and underflow */ if (fabsf (a - 1.0f) > 88.0f) { asm ("ex2.approx.ftz.f32 %0,%1;" : "=f"(r) : "f"(a)); r = fmaf (r, r, -1.0f); } return r; }[/s]

njuffa · December 18, 2019, 7:24am

I revisited the expm1f() implementation after almost three years and was able to create a faithfully-rounded version this time around, i.e. maximum error is < 1 ulp. It is even a bit faster than the code I had posted previously. On a Pascal-architecture GPU, this executes with about 14% higher throughput than the CUDA 8 built-in function.

Code below updated 7/29/2020, 8/16/2020, 1/28/2023

/*
  Copyright (c) 2015-2023 Norbert Juffa
  All rights reserved.

  Redistribution and use in source and binary forms, with or without 
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright 
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

/* Compute exponential base e minus 1. Maximum ulp error = 0.997458

   i = rint(a/log(2)), f = a-i*log(2). Then expm1(a) = 2**i * (expm1(f)+1) - 1.
   Compute r = expm1(f). Then expm1(a)= 2 * (0.5 * 2**i * r + 0.5 * 2**i - 0.5).
   With t = 0.5*2**i, expm1(a) = 2*(r * t + t-0.5). However, for best accuracy,
   when i == 1, expm1(a)= 2*(r + 0.5), and when i == 0, expm1(a) = r.

   NOTE: Scale factor b is only applied if i < 0 or i > 1 (should be power of 2)
*/
__device__ float my_expm1f_scaled_unchecked (float a, float b)
{
    float f, j, r, s, t, u, v, x, y;
    int i;

    // exp(a) = 2**i * exp(f); i = rintf (a / log(2))
    j = fmaf (1.442695f, a, 12582912.f); // 0x1.715476p0, 0x1.8p23
#if PORTABLE
    j = j - 12582912.0f; // 0x1.8p23
    i = (int)j;
#else // PORTABLE    
    i = __float_as_int (j); // trailing bits contain integer
    j = j - 12582912.f; // 0x1.8p23
#endif // PORTABLE
    f = fmaf (j, -6.93145752e-1f, a); // -0x1.62e400p-1  // log_2_hi 
    f = fmaf (j, -1.42860677e-6f, f); // -0x1.7f7d1cp-20 // log_2_lo

    // approximate r = exp(f)-1 on interval [-log(2)/2, +log(2)/2]
    s = f * f;
    if (a == 0.0f) s = a; // ensure -0 is passed through
#if MORE_ILP
    // err = 0.997458  ulp1 = 11094007
    r = fmaf (1.98423862e-4f, f, 1.39347673e-3f);//0x1.a02000p-13,0x1.6d4aa4p-10
    t = fmaf (8.33342969e-3f, f, 4.16667424e-2f);//0x1.1111e0p-7, 0x1.55557ep-5
    r = fmaf (r, s, t);
    r = fmaf (r, f, 1.66666701e-1f); // 0x1.55555ap-3
    r = fmaf (r, f, 4.99999970e-1f); // 0x1.fffffep-2
#else // MORE_ILP
    // err = 0.997458  ulp1 = 11081805
    r =             1.97350979e-4f;  // 0x1.9de000p-13
    r = fmaf (r, f, 1.39309070e-3f); // 0x1.6d30bcp-10
    r = fmaf (r, f, 8.33343994e-3f); // 0x1.1111f6p-7
    r = fmaf (r, f, 4.16668020e-2f); // 0x1.55559ep-5
    r = fmaf (r, f, 1.66666716e-1f); // 0x1.55555cp-3
    r = fmaf (r, f, 4.99999970e-1f); // 0x1.fffffep-2
#endif // MORE_ILP
    // if i == 0, expm1(a) = z
    // if i == 1, expm1(a) = 2*(r*(f*f)+f+0.5)
    // if (i < 0) || (i > 1) expm1(a) = 2*((r*(f*f)+f)*t-0.5+t)
    u = (j == 1) ? (f + 0.5f) : f;
    v = fmaf (r, s, u);
    s = 0.5f * b;
#if PORTABLE
    t = ldexpf (s, i);
#else // PORTABLE
    t = __int_as_float ((i << 23) + __float_as_int (s));
#endif // PORTABLE
    y = t - s;
    x = (t - y) - s; // double-float canonicalization of difference
    r = fmaf (v, t, x) + y;
    r = r + r;
    if (j == 0) r = v;
    if (j == 1) r = v + v;
    return r;
}

/* Compute exponential base e minus 1. max ulp err = 0.99746 */
__device__ float my_expm1f (float a)
{
    float r;

    r = my_expm1f_scaled_unchecked (a, 1.0f);
    /* handle severe overflow and underflow */
    if (fabsf (a - 1.0f) > 88.0f) {
        asm ("ex2.approx.ftz.f32 %0,%1;" : "=f"(r) : "f"(a));
        r = fmaf (r, r, -1.0f);
    }
    return r;
}

Topic		Replies	Views
Accuracy-optimized implementation of expm1f() without performance penalty CUDA Programming and Performance	6	261	July 10, 2025
A more accurate, performance-competitive implementation of expf() CUDA Programming and Performance	41	9283	July 24, 2025
More accurate version of exp2f() with no change in performance CUDA Programming and Performance	0	840	February 17, 2023
Anticipating future standards: Implementation of exp2m1f() CUDA Programming and Performance	1	607	August 11, 2021
Accuracy-optimized performance-neutral implementation of erfcxf() CUDA Programming and Performance	2	486	September 4, 2022
Faster and more accurate implementation of log1pf() CUDA Programming and Performance	19	3549	July 27, 2025
A more accurate and faster implementation of powf() CUDA Programming and Performance	6	4520	February 13, 2025
Optimized version of single-precision error function, erff() CUDA Programming and Performance	21	4738	December 25, 2017
exp() gives too much impresision with float CUDA Programming and Performance	11	9639	June 8, 2009
Expm1() problems - precision Linux	0	763	February 14, 2014

A faster and more accurate implementation of expm1f()

Related topics