A more accurate, performance-competitive implementation of expf()

Among the numerous transcendental functions provided by standard math libraries, the exponential function is often the most performance sensitive. In other words, this function needs to be fast. For this reason, CUDA’s implementation of the single-precision exponential function, expf(), makes use of the fast hardware approximation to exp2(x) provided by the multi-function unit of the GPU via the MUFU.EX2 instruction. Unfortunately, this instruction isn’t very accurate, so that CUDA’s expf() winds up with a maximum error of 2.31917 ulp.

Typically, for functions like the exponential function that serve as the building block for many higher transcendental functions, a faithfully-rounded implementation is desired, meaning the error should be less than 1 ulp. There are in fact real-life use cases where such accuracy is more or less required.

The challenge in the context of CUDA is to provide such an implementation without a negative impact on performance. I have made several attempts to create such an implementation over the years, however due to the lack of a hardware support for ldexpf() functionality (used to scale floating-point numbers) in NVIDIA GPUs it is extremely challenging (from what I know, AMD GPUs do have such an instruction).

I have finally managed to come up with code that is performance competitive with CUDA 8’s implementation of expf(), in default compilation mode. On my Quadro K2200 (basically like a GTX 750 Ti, but with larger memory), CUDA 8’s built-in expf() has a throughput of 33.4e9 function evaluations per second, while my replacement provides throughput of 35.1e9 function evaluations per second. It should be noted that CUDA’s built-in implementation sees significant speed-up, to 39.3e9 function evaluations per second, when -ftz=true is specified. And it will be faster by factors if -use_fast_math is specified, as this causes expf() to be replaced by the device intrinsic __expf(), with much larger error.

This means that the code below is like only of interest for those who do not use -ftz=true or -use_fast_math, and desire improved accuracy. I figured it might still be useful to some people, so why not post it? The error bound for my_expf() is 0.86565 ulp, meaning the function is faithfully rounded.

Code below updated 2/8/2017, 2/17/2017, 8/3/2017, 11/19/2017, 3/11/2017, 8/2/2021

/*
  Copyright (c) 2015-2021, Norbert Juffa
  All rights reserved.

  Redistribution and use in source and binary forms, with or without 
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright 
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

__device__ float my_expf (float a)
{
    float f, r, j, s, t;
    int i, ia;

    // exp(a) = 2**i * exp(f); i = rintf (a / log(2))
    j = fmaf (1.442695f, a, 12582912.f) - 12582912.f; // 0x1.715476p0, 0x1.8p23
    f = fmaf (j, -6.93145752e-1f, a); // -0x1.62e400p-1  // log_2_hi 
    f = fmaf (j, -1.42860677e-6f, f); // -0x1.7f7d1cp-20 // log_2_lo 
    i = (int)j;
    // approximate r = exp(f) on interval [-log(2)/2, +log(2)/2]
    r =             1.37805939e-3f;  // 0x1.694000p-10
    r = fmaf (r, f, 8.37312452e-3f); // 0x1.125edcp-7
    r = fmaf (r, f, 4.16695364e-2f); // 0x1.555b5ap-5
    r = fmaf (r, f, 1.66664720e-1f); // 0x1.555450p-3
    r = fmaf (r, f, 4.99999851e-1f); // 0x1.fffff6p-2
    r = fmaf (r, f, 1.00000000e+0f); // 0x1.000000p+0
    r = fmaf (r, f, 1.00000000e+0f); // 0x1.000000p+0
    // exp(a) = 2**i * r
    ia = (i > 0) ?  0 : 0x83000000;
    s = __int_as_float (0x7f000000 + ia);
    t = __int_as_float ((i << 23) - ia);
    r = r * s;
    r = r * t;
    // handle special cases: severe overflow / underflow
    if (fabsf (a) >= 104.0f) r = s * s;
    return r;
}

I updated the code in my original post #1.

I boosted the performance of my_expf() by simplifying the scaling by 2**i and the handling of special cases. On a Quadro K2200 (sm_50), using CUDA 8.0 in default compilation mode, this increased the throughput to 35.8e9 function calls / second, versus 33.4e9 function calls / second for CUDA’s built-in function.

On the Quadro K2200, with -ftz=true, performance drops slightly to 33.4e9 function calls / second due to what looks like a flaw of some sort in the sm_50 code generator of PTXAS: One would expect the same machine code sequence for -ftz=false and -ftz=true, except for .FTZ suffixes on the FP32 instructions, and this is indeed the case when building for sm_2x and sm_3x targets. With sm_50, however, a slightly longer code sequence is produced with -ftz=true, in particular a predicated multiply is replaced by an unconditional multiply followed by a select (SEL) instruction.

I improved the accuracy slightly (higher number of correctly rounded results, but maximum ulp error unchanged).

Ha, you know about godbolt too!

You know, njuffa, there’s a C++ Slack team, cpplang. The creator of Godbolt posts there on occasion which is how I know about the tool. You might wanna consider joining! I even made a cuda channel but it doesn’t get much action there :P

Good job at beating Nvidia’s implementation too!

I learned of godbolt.org from posts on Stackoverflow, about a year ago. At first I thought, “What a cool name for a site”, until I noticed it is simply named after its creator, Matt Godbolt. I use it occasionally to evaluate coding alternatives across a variety of compilers; it would be too cumbersome to maintain my own collection.

My goal in “playing” with the standard mathematical functions is to explore how simple high-performance implementations with good accuracy can be made. It’s an interesting puzzle, but with practical benefits. Was it Einstein who said: “Make everything as simple as possible, but no simpler than that”?

Another such puzzle I am currently working on is how to emulate Kepler’s SIMD instructions on later GPU architectures, for example by making the best use of LOP3. Some bioinformatics codes used the instructions so this has some practical relevance as well. The one thing I can say is that the CUDA compiler has improved a lot since I last tackled the emulation four years ago. At that time I had to use quite a bit of inline PTX to get good code, now it is sufficient to code at C++ level. The LOP3 generation is a bit iffy though: not every logical expression of three variables is handled equally well, even if the only difference is swapping two terms (A logic-op B, vs B logic-op A).

I am trying to understand some of the genius that went into this routine. I am struggling with these lines though:

f = fmaf (j, -6.93145752e-1f, a); // -0x1.62e400p-1  // log_2_hi 
f = fmaf (j, -1.42860677e-6f, f); // -0x1.7f7d1cp-20 // log_2_lo

Why are you multiplying with the hi part first, when multiplying with the low part first would get you closer to the correctly rounded result?

This is a standard technique known as a “Cody-Waite argument reduction”. I am not entirely sure that these two gentlemen invented it, but at minimum they popularized it in their book “Software Manual for the Elementary Functions” (1970s).

It is basically a way of preserving full precision in an argument reduction affected by subtractive cancellation, by representing the necessary constant as a sum of values of decreasing magnitude, such that each partial product, except for the least significant one, can be represented exactly. This does not require the availability of FMA. Here, the most significant part has >= 8 trailing zero bits, because the useful range of ‘i’ requires eight bits.

When using FMA, it is not necessary to use that many trailing zeros in each constant in the chain, as FMA uses the full product internally. See for example my_sincosf() that I posted a while ago, which uses a three-stage Cody-Waite reduction. A useful reference is: Sylvie Boldo, Marc Daumas, and Ren Cang Li, “Formally Verified Argument Reduction with a Fused-Multiply-Add” (ArXiv draft: https://arxiv.org/abs/0708.3722).

I could have changed the constants here since FMA is definitely available, but I was too lazy to change the constants I have used for the past twenty-five years :-) The difference in accuracy would be negligible, as the current computation already provides a sufficient number of bits.

There isn’t actually any genius in this code, just a lot of fiddly work to reduce each stage to the minimum number of operations possible. It starts off with a classical “round to integer by magic number addition” (I first saw this in the 1990s in some IBM publication), followed by a Cody-Waite reduction (1970s), followed by a polynomial minimax approximation (going back to the Remez algorithm of 1927), followed by a custom implementation of ldexpf(), which took numerous hours and about ten different design approaches to reduce to the shortest possible code. In terms of special cases, NaN are handled correctly by the fastpath, leaving only cases of severe overflow and underflow to be handled by a conditional multiply.

Kepler cards supported SIMD? O_o

That’d be amazing! I love CUDA because of SIMT but if I could SIMD while I SIMT, that’d be even better. I actually have genuine usages for SIMD in some of my kernels as well.

take a look at the simd instructions in PTX:

http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#simd-video-instructions

or intrinsics:

http://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__SIMD.html#group__CUDA__MATH__INTRINSIC__SIMD

whether or not these actually map to a single SASS instruction varies by hardware.

A terrific example of Kepler’s bytewise SIMD used for compute. It’s about the only example, but it was a great one.

Not fully generalized SIMD, but a few key in-register operations. See the SIMD intrinsics in the documentation:

http://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__SIMD.html#group__CUDA__MATH__INTRINSIC__SIMD

Useful enough for some genetics codes, for example:

Hanyu Jiang and Narayan Ganesan, “CUDAMPF: a multi-tiered parallel framework for accelerating protein sequence search in HMMER on CUDA-enabled GPU”. BMC Bioinformatics (2016) 17:106 (online: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4769571/)

which you can find online. I now see that SPWorley already pointed to this older publication:

Yongchao Liu, Adrianto Wirawan, and Bertil Schmidt, “CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions”. BMC Bioinformatics (2013) 14:117 (online: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-117)

The only SIMD instructions retained for Maxwell and beyond seem to be the absolute differences and sum of absolute differences (SAD) instructions. Everything else needs to be emulated. Here is a fully worked example for the vcmpltu4() operation [at the end, everything else are building blocks to compose this operation in modular fashion]:

#define UINT32_H4  0x80808080U   // byte-wise sign bits (MSBs)
#define UINT32_L4  0x01010101U   // byte-wise LSBs 

/* extract sign bits and convert them into a boolean, byte-wise */
static __host__ __device__ uint32_t sign_to_bool4 (uint32_t a)
{
    return (a & UINT32_H4) >> 7;
}

/* extend sign bits into mask, byte-wise */
static __host__ __device__ uint32_t sign_to_mask4 (uint32_t a)
{
#if (__CUDA_ARCH__ >= 200)
    asm ("prmt.b32 %0,%0,0,0xba98;" : "+r"(a)); // convert MSBs to masks
#else
    a = a & UINT32_H4;    // isolate sign bits
    a = a + a - (a >> 7); // extend them to full byte to create mask
#endif
    return a;
}

/* extend boolean into mask, byte-wise */
static __host__ __device__ uint32_t bool_to_mask4 (uint32_t a)
{
    return (a << 8) - a;
}

#if (__CUDA_ARCH__ >= 500) // take advantage of LOP3
static __host__ __device__ uint32_t ltu4_core (uint32_t a, uint32_t b)
{
    /* Sebastiano Vigna, "Broadword implementation of rank/select queries." 
       In: International Workshop on Experimental and Efficient Algorithms, 
       pp. 154-168, Springer Berlin Heidelberg, 2008.
    */
    return (((a | UINT32_H4) - (b & ~UINT32_H4)) | (a ^ b)) ^ (a | ~b);
}
#endif // (__CUDA_ARCH__ >= 500)

static __host__ __device__ uint32_t vhaddu4 (uint32_t a, uint32_t b)
{
    /* Peter L. Montgomery's observation (newsgroup comp.arch, 2000/02/11,
       https://groups.google.com/d/msg/comp.arch/gXFuGZtZKag/_5yrz2zDbe4J):
       (A+B)/2 = (A AND B) + (A XOR B)/2.
    */
    return (a & b) + (((a ^ b) >> 1) & ~UINT32_H4);
}

static __host__ __device__ uint32_t vsetltu4 (uint32_t a, uint32_t b)
{
#if (__CUDA_ARCH__ >= 300) && (__CUDA_ARCH__ < 500)
    uint32_t t = 0;
    asm ("vset4.u32.u32.lt %0,%1,%2,%0;" : "+r"(t) : "r"(a), "r"(b));
    return t;
#elif (__CUDA_ARCH__ >= 500)
    return sign_to_bool4 (ltu4_core (a, b));
#else // generic path
    return sign_to_bool4 (vhaddu4 (~a, b));
#endif
}

static __host__ __device__ uint32_t vcmpltu4 (uint32_t a, uint32_t b)
{
#if (__CUDA_ARCH__ >= 300) && (__CUDA_ARCH__ < 500)
    return bool_to_mask4 (vsetltu4 (a, b));
#elif (__CUDA_ARCH__ >= 500)
    return sign_to_mask4 (ltu4_core (a, b));
#else // generic path
    return sign_to_mask4 (vhaddu4 (~a, b));
#endif
}

this code is buggy:

i = (int)j;

i is unspecified for large j (language standard), then i is used to compute the result for large inputs, so the sign check should be done differently:

ia = (i > 0) ?  0 : 0x83000000;

This is optimized code written for a specific platform (NVIDIA GPUs with CUDA), where the overflow behavior provided by the ‘float’ to ‘int’ conversion in the hardware is known. It clamps to the end of the range, so the sign of the conversion result is correct.

This CUDA implementation has been tested exhaustively for all 2**32 possible inputs. Re-using the code for other platforms may require adjustments, in particular the safety check at the end may need to be made independent of the computation of ‘i’, e.g. like so:

if (fabsf (a) >= 104.0f) {
    r = (a > 0.0f) ? (1e38f * 1e38f) : 0.0f;
}

However, in my experiments this reduced function throughput by 2%, which is why I opted for the re-use of the scale factor ‘s’ (which depends on ‘i’, as you noted). For arguments |a| < 104 the conversion from ‘j’ to ‘i’ is perfectly safe even in the abstract sense of the C++ standard. The basic design here is common for optimized CUDA implementations, where results from a common path, which may well return gibberish for certain inputs (e.g. here the computation of ‘j’ for large inputs), are conditionally overwritten with well-defined special-case results at the end.

The reason the computation of ‘ia’ (the “i-adjustment”) is written the way it is, is to take advantage of the ‘ICMP’ machine instructions (a select-type instruction) which implements the present computation in a single instruction. Writing optimized math functions with CUDA is really an exercise in writing the desired assembly code indirectly, through the compiler. This approach is brittle, but not as brittle as it used to be.

this code can be found by a web search and then misused if there is no indication that it is a non-portable implementation.

note that this is valid reasoning if the code is for “iso c with annex f”, but if c++ semantics is assumed then the behaviour is undefined, so instead of producing gibberish temporarily, the compiler is free to drop the affected code paths (or break the generated code in other ways), iow this is dangerous in the hands of users who don’t know the implications.

I am familiar with the concepts of portable C and C++, own copies of the relevant ISO standards, and have read them in part.

This is besides the point, however, in that I wrote the code specifically for CUDA with full knowledge of platform-specific properties. For example, I know that for single-precision computation, only canonical NaNs are use, and can take advantage of that if I choose to (no need to preserve payload). Similarly, I can take advantage of float->int conversion being well-defined in cases of overflow. I could right shift signed negative integers knowing this will result in an arithmetic right shift. Etc, etc.

Just because I made this code available does not mean there is a requirement for me to point out any and all potential caveats, including limited portability to platforms other than CUDA, nor is there an implied guarantee that the code even works (on any platform), or that it would never break with future CUDA toolchains. The code is provided “as is”, as noted in the accompanying two-clause BSD license.

I do not see how this code is different from the myriad other pieces of open-source code floating around the internet that come with no detailed design discussion whatsoever, and are in fact not strictly portable and/or buggy. This includes (or at least, included) code in well known projects like glibc. In all these cases, people are free to use the code offered as they see fit (as long as they are compliant with the license requirements), make adjustments needed for other platforms or full standard portability, and test for compliance with their specific requirements. Caveat emptor.

That said, I try not to turn out crud that does not work, by applying a reasonable amount of testing. Not nearly as much as would have to go into a CUDA production release of such code (which I used to do for a living). That would kill all of my fun, which is in tinkering with algorithms and optimizations, then sharing some of the results / ideas.

The code, as posted, includes

__device__

This is indication of non-portability and only works with the nvcc compiler, and even then can only target a CUDA capable GPU.

If your concern is around those who make modifications to code that they find, then all such code that can be found by a web search is susceptible to such “misuse”.

@njuffa, I’m sure nsz meant no offence. Needless to say, sharing optimized implementations like this is extremely appreciated. But let me point out that apart from CUDA-specific tuning, there’s a more general aspect of this implementation being well-suited to any fma-capable platform, and in that sense I hope a discussion of potential improvements for those can take place :)

I think the code is easily fixed for platforms that don’t guarantee to preserve sign on overflowing float-to-int conversions by computing ia with ‘ia = (j > 0) ? 0 : 0x83000000;’ (that is, using j instead of i), since for j near zero it doesn’t matter with ia you choose. Moreover, I think it also should have same-or-better performance since it gives slightly higher scheduling freedom for that region of code. In fact, nsz measured a 5% improvement (on one benchmark, on one non-gpu platform).

Does that look correct?

No offense taken. I was trying to correct an apparent misconception that this code comes with guarantees of any sort, or that posting the code incurs additional responsibilities on my part.

You do not state the non GPU-platform used, but I am guessing it is x64, as that definitely would fail due to overflow in conversion of ‘j’ to ‘i’: this always produces 0x80000000, regardless of the sign of ‘j’.

As I said, the arrangement of this code is strictly tuned to specific instructions available on NVIDIA GPUs, and under the consideration of the relative throughput for various instruction classes (e.g. ‘float’ computation is often faster than ‘int’ computation). In the specific case of ‘ia’, making the choice dependent on ‘j’ emits a two-instruction sequence, whereas using ‘i’ results in a single instruction. The use of ‘j’ therefore makes the code slower when I measure it on my sm_50 GPU.

The situation may well be different on other hardware platforms and with other toolchains. Making the decision dependent on ‘a’ may be safest from a strict portability standpoint and may help performance by allowing the comparison to be scheduled early. The “rint() by magic addition” computation that produces ‘j’ will do strange things for large inputs, although it is at least deterministic under abstract HLL semantics (assuming fma() works properly with IEEE-754 (2008) semantics, which is not the case for some versions of gcc).

FWIW, I am still tinkering with the core approximation used here, in the hopes of increasing accuracy further.

In addition to portability of the ‘float’ -> ‘int’ conversion already discussed, when porting the posted code to other platforms (where it may be intended as a replacement for a platform specific math library), the following items may have to be addressed:

(1) The code pays no attention to the raising of floating-point exceptions, whether for the final result or spurious ones caused by intermediate computation [NVIDIA GPUs do not implement floating-point flags]

(2) There is no handling of ‘errno’, e.g. for ERANGE or EDOM exceptional conditions [the CUDA standard math library does not implement this piece of global state; impractical in an environment with 10Ks of threads]

(3) The code pays no attention to the handling of NaN “payloads” [NVIDIA GPUs use only one canonical NaN encoding in single-precision computation, which is compliant with IEEE-754 (2008); “payloads” are only a “should” requirement]

There may be additional issues to consider, this is what came to mind momentarily.

@njuffa, are you passing ‘–fmad false’ to nvcc? I get the bad behavior you describe (1 extra insn) without that option, due to ptxas being too smart and emitting ‘j > 0’ as ‘fmaf (1.442695f, a, 12582912.f) > 12582912.f’. However, ‘–fmad false’ forces ptxas to suppress that ‘optimization’, and as as result you get the same instruction count, with your ICMP changed to FCMP, plus a few scheduling differences.

Funny (peculiar). I remember trying to get the CUDA compiler to produce FCMP, but it would not. For what it is worth the compiler would only give me ICMP with certain conditions (such as “> 0”), otherwise it would produce a two-instruction sequence applying a mask created by arithmetic right shift by 31 (sloooow!). As I said, trying to get a specific desired code sequence in SASS from HLL is brittle; when I was the maintainer of the CUDA standard math library at NVIDIA I had to re-write portions of it for every new compiler version to account for changes in code-generation artifacts. At that time the CUDA standard math library was delivered as header file only, due to technical limitations.

I did not revisit the FCMP / ICMP issue after switching to CUDA 8.0 recently, maybe I should have. My recent changes were strictly focused on cleaning up the scaling and special-case handling, without revisiting the upstream code.

In the context of CUDA, code must work properly with -ftz=true and -fmad=false, and I observe that this can change the generated code (at least for sm_50), which is unexpected for code like this. These flags should be orthogonal to this code. For performance measurements I usually limit myself to default compilation mode (so, -ftz=false, -fmad=true), I do not want to create a full-blown test framework that cycles the code through various compiler switches. Similarly I only test on one architecture (sm_50 at the moment).