Accuracy-optimized implementation of expm1f() without performance penalty

njuffa · June 30, 2024, 7:42pm

The function expm1(x) := exp(x)-1, often used in conjunction with log1p(), was introduced some forty years ago to avoid numerical issues with subtractive cancellation when using plain exp().

A typical example can be seen in this recent question on Math Stackexchange, which arose in the context of neural networks.

For a standard math library it is desirable for the maximum error of simple math functions to be less than 1 ulp, making such functions faithfully rounded. An independent report shows the maximum error in CUDA’s current implementation of the single-precision variant expm1f() as < 1.45 ulp:

Brian Gladman, Vincenzo Innocente, John Mather, Paul Zimmermann, Accuracy of Mathematical Functions in Single, Double, Double Extended, and Quadruple Precision. HAL preprint hal-03141101, Feb. 2024 (online).

I reproduced this particular finding of the report, in that the maximum error of expm1f(x) in CUDA 12.3 is found as 1.44463616 ulp @ x = 17.0407963f. The following code shows a faithfully-rounded implementation of expm1f() that does not incur a performance penalty compared to the built-in function from CUDA 12.3; on some GPU architectures it is likely even be a bit faster.

[comment fixed in code below on 7/9/2025]

/*
  Copyright (c) 2015-2024, Norbert Juffa
  All rights reserved.

  Redistribution and use in source and binary forms, with or without 
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright 
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

/* Compute exponential base e minus 1. Maximum ulp error = 0.997458

   i = rint(a/log(2)), f = a-i*log(2). Then expm1(a) = 2**i * (expm1(f)+1) - 1.
   Compute r = expm1(f). Then expm1(a)= 2 * (0.5 * 2**i * r + 0.5 * 2**i - 0.5).
   With t = 0.5*2**i, expm1(a) = 2*(r * t + t-0.5). However, for best accuracy,
   when i == 1, expm1(a)= 2*(r + 0.5), and when i == 0, expm1(a) = r.

   NOTE: Scale factor b is only applied if i < 0 or i > 1 (should be power of 2)
*/
__device__ float my_expm1f_scaled_unchecked (float a, float b)
{
    float f, j, r, s, t, u, v, x, y;
    int i;

    // exp(a) = 2**i * exp(f); i = rintf (a / log(2))
    j = fmaf (1.442695f, a, 12582912.f); // 0x1.715476p0, 0x1.8p23
    i = __float_as_int (j); // trailing bits contain integer
    j = j - 12582912.f; // 0x1.8p23
    f = fmaf (j, -6.93145752e-1f, a); // -0x1.62e400p-1  // log_2_hi 
    f = fmaf (j, -1.42860677e-6f, f); // -0x1.7f7d1cp-20 // log_2_lo

    // approximate r = exp(f)-1 on interval [-log(2)/2, +log(2)/2]
    s = f * f;
    if (a == 0.0f) s = a; // ensure -0 is passed through
    r = fmaf (1.98662281e-4f, f, 1.39354519e-3f);//0x1.a0a000p-13,0x1.6d4f3cp-10
    t = fmaf (8.33332818e-3f, f, 4.16667648e-2f);//0x1.111106p-7, 0x1.55558ap-5
    r = fmaf (r, s, t);
    r = fmaf (r, f, 1.66666716e-1f); // 0x1.55555cp-3
    r = fmaf (r, f, 4.99999970e-1f); // 0x1.fffffep-2

    // if i == 0, expm1(a) = r*(f*f)+f
    // if i == 1, expm1(a) = 2*(r*(f*f)+f+0.5)
    // if (i < 0) || (i > 1) expm1(a) = 2*((r*(f*f)+f)*t-0.5+t)
    u = (j == 1) ? (f + 0.5f) : f;
    v = fmaf (r, s, u);
    s = 0.5f * b;
    t = __int_as_float ((i << 23) + __float_as_int (s));
    y = t - s;
    x = t - y;
    x = x - s; // double-float canonicalization of difference
    r = fmaf (v, t, x);
    r = r + y;
    r = r + r;
    if (j == 0) r = v;
    if (j == 1) r = v + v;
    return r;
}

/* Compute exponential base e minus 1. max ulp err = 0.99746 */
__device__ float my_expm1f (float a)
{
    float r;

    r = my_expm1f_scaled_unchecked (a, 1.0f);
    /* handle severe overflow and underflow */
    if (fabsf (a - 1.0f) > 88.0f) {
        asm ("ex2.approx.ftz.f32 %0,%1;" : "=f"(r) : "f"(a));
        r = fmaf (r, r, -1.0f);
    }
    return r;
}

damianmoz · July 9, 2025, 12:45am

Really impressive work.

Is the polynomial you use a Remez polynomial?

Also, is there any documentation on the detail beyond the comments in the routine?

Thanks.

njuffa · July 9, 2025, 1:36am

Yes, it’s a polynomial minimax approximation generated with the Remez algorithm, and then refined further using a proprietary heuristic lattice search. The hard part here is actually not puzzling out the coefficients, it is getting from the value returned by core approximation to the final result with as little code, as few operations, as little branching, and as much accuracy as possible. I tackled that part in stages over multiple years. My recollection is a bit vague, but I think I revisited that portion of the code three times until finally achieving a faithfully-rounded expm1f() implementation.

damianmoz · July 9, 2025, 4:42pm

The primary line I do not understand is:

What is ‘z’? It is not even a variable or are my eyes deceiving me?

Or do you mean s which is the key just above z?

Thanks - Damian

njuffa · July 9, 2025, 6:15pm

Well spotted. Classical case of comment rot. The code organization and the variables involved changed significantly, yet the comment remained, making it highly misleading. I think (before first coffee, so take this with a grain of salt) the comment should read

// if i == 0, expm1(a) = r*(f*f)+f

Basically, just finish evaluating the polynomial in that case.

FWIW, expm1f from CUDA 12.8 still shows with an error bound of 1.45 ulps in the latest version of the report I referenced in my original post:

Brian Gladman, Vincenzo Innocente, John Mather, Paul Zimmermann, Accuracy of Mathematical Functions in Single, Double, Double Extended, and Quadruple Precision. HAL preprint hal-03141101v8, Feb. 2025 (online)

[Later:]

I verified that the basic control flow that concludes the expm1f computation is as follows:

    if (j == 0) return r * s + f;
    if (j == 1) return 2 * (r * s + (f + 0.5f));
    t = ldexpf (0.5f * b, i);
    return 2 * ((r * s + f) * t + (t - 0.5f));

The actual code is considerably more convoluted to improve accuracy and eliminate branches.

Thanks for pointing out the issue with the incorrect comment, I will edit in the correction so as not to confuse future readers.

damianmoz · July 10, 2025, 9:14am

How did you know to round up for the case of j ==1 and then round down for the case of j < 0 || j > 1. Or was that just experimentation?

njuffa · July 10, 2025, 9:33am

Sorry, I don’t understand what “rounding” you are referring to. Mathematically, the different formulas all compute exp(a)-1 from exp(f)-1 and are equivalent. But numerical behavior in finite-precision floating-point arithmetic differs from mathematical behavior.

If you compare the approaches I chose here to other open source implementation, you will likely find similar but also different arrangements, likewise chosen for their numerical properties, just like I chose the ones used here because they seemed most suitable for a GPU implementation. One of my goals is to make maximum use of the beneficial numerical properties of fused multiply-add (FMA).

I did run experiments to explore which arrangements might be the most suitable ones. I am not a mathematician, and without experimentation I cannot tell whether a particular mathematically equivalent arrangement will achieve a maximum error of, say, 1.05 ulp vs 0.95 ulp.

Most algorithms used in math libraries are designed by mathematicians with extensive experience in numerical analysis and they may be able to do that, and might even do that intuitively, but I cannot. I am a typical engineer (computer science degree with a minor in mechanical engineering; some math background) who constantly cooks up new ideas (or extracts them from the literature), tries them out through exploratory implementations (experiments, if you want), and in the end retains what I found to work the best under a specific set of constraints.

Topic		Replies	Views
A faster and more accurate implementation of expm1f() CUDA Programming and Performance	1	1289	December 18, 2019
A more accurate, performance-competitive implementation of expf() CUDA Programming and Performance	38	8552	July 15, 2025
Anticipating future standards: Implementation of exp2m1f() CUDA Programming and Performance	1	561	August 11, 2021
Faster and more accurate implementation of log1pf() CUDA Programming and Performance	17	3357	July 10, 2025
More accurate version of exp2f() with no change in performance CUDA Programming and Performance	0	755	February 17, 2023
Accuracy-optimized performance-neutral implementation of erfcxf() CUDA Programming and Performance	2	430	September 4, 2022
A faster double-precision exp() CUDA Programming and Performance	8	3155	September 5, 2017
Expm1() problems - precision Linux	0	742	February 14, 2014
expm(f) -? (Cuda+Matlab) how we would implement the calculation of exponential matrices (expm) CUDA Programming and Performance	6	2917	February 28, 2024
Optimized version of single-precision error function, erff() CUDA Programming and Performance	21	4523	December 25, 2017

Accuracy-optimized implementation of expm1f() without performance penalty

Related topics