Optimized version of single-precision error function, erff()

The single-precision error function, erff(), in CUDA 7.5 is already very fast. For various reasons I spent some time over the holidays trying to build a replacement that is even faster. As it turned out, that is quite hard to do. I managed a mere 4% performance increase as measured on my K2200 (sm_50): function throughput increased from 28.17 billion function calls per second with CUDA 7.5 to 29.32 billion function calls per second with my custom implementation.

I did manage a significant improvement in accuracy, however, with the maximum error of my implementation coming in at 1.29532 ulp, versus the maximum error of 2.33520 ulp of erff() in CUDA 7.5. As you can see from the code below, the two core approximations used are not evenly matched in terms of their maximum error. Therefore, overall accuracy could be improved even further by moving the switchover point between the two approximations down to about 0.85 instead of 1.0. However, moving down the switchover point even slightly resulted in a performance decrease relative to CUDA 7.5 in my tests.

[code below updated 10/13/2016, 7/24/2017, 11/12/2021]

/*
  Copyright (c) 2016-2021, Norbert Juffa
  All rights reserved.

  Redistribution and use in source and binary forms, with or without 
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright 
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

/* compute error function. max ulp error = 1.29532 */
__device__ float my_erff (float a)
{
    float r, s, t, u;
 
    t = fabsf (a);
    s = a * a;
    if (t > 1.0f) {
        // ulp error = 1.03872
        r =              2.58907676e-5f;  //  0x1.b26000p-16
        r = fmaf (r, t, -5.64874616e-4f); // -0x1.282830p-11
        u =              5.66795561e-3f;  //  0x1.737484p-8
        u = fmaf (u, t, -3.51759829e-2f); // -0x1.202962p-5
        r = fmaf (r, s, u);
        r = fmaf (r, t,  1.54329389e-1f); //  0x1.3c110cp-3
        r = fmaf (r, t,  9.15674746e-1f); //  0x1.d4d352p-1
        r = fmaf (r, t,  6.28459513e-1f); //  0x1.41c572p-1
        r = fmaf (r, t, t);
        r = -r;
        asm ("ex2.approx.ftz.f32 %0,%0;" : "+f"(r));
        r = 1.0f - r;
        r = __int_as_float (__float_as_int(a) & 0x80000000 | __float_as_int(r));
    } else {
        // ulp error = 1.29532
        r =             -5.58853149e-4f;  // -0x1.250000p-11
        r = fmaf (r, s,  4.90735564e-3f); //  0x1.419bc4p-8
        r = fmaf (r, s, -2.67030653e-2f); // -0x1.b580c6p-6
        r = fmaf (r, s,  1.12799220e-1f); //  0x1.ce068ep-4
        r = fmaf (r, s, -3.76123011e-1f); // -0x1.812664p-2
        r = fmaf (r, s,  1.28379121e-1f); //  0x1.06eba2p-3
        r = fmaf (r, a, a);
    } 
    return r;
}

I have updated the code above as follows:

  1. Improved accuracy
  2. Improved energy efficiency by reducing least-significant coefficients to “half precision” (12 trailing zero bits), allowing them to become floating-point immediates that are part of the floating-point instruction

njuffa, you’re always posting such awesome stuff!

I think it would be great to collect these gems on e.g. github because in the case the nVidia forum software is hacked or the server crashes catastrophically all of the goodness might be lost. ;-/

There is still hope I might get more organized in the future :-) But I also hope that my code will become superfluous in due course because CUDA will offer the same or better accuracy and performance.

My “goals” in posting some sample implementations of math functions here and at Stackoverflow:

(1) Promote the aggressive use of FMA for performance and accuracy.
(2) Promote use of machine-optimized polynomial approximations for math functions
(3) Promote (partial) use of Estrin’s scheme for polynomial evaluation to improve ILP
(4) Lately [still embryonic]: Promote energy awareness in computation

If I were an academic, I’d publish papers about this, and if I were more modern, I’d blog about it. Both would arguably be a more impactful way to “get the message out”, but somehow I don’t feel the urge to write these days (it was different in my student days).

Instead I simply tinker with things whenever I come across an interesting application, and if the results seem useful, why not share them?

I always get a grin on my face when I see a new or updated Juffarific optimization post.

I really loved learning about the immediate 16 bit float power optimization! I’m not a SASS ninja and hadn’t even realized there was such a SASS instruction mode, but it makes a lot of sense. And with Maxwell and Pascal GPUs, such power savings really are also effectively speed enhancements since those GPUs are running with dynamic boost clocks.

I like how your optimization is tuned using evaluations on the GPU itself, allowing you to use CUDA’s fast approximate exp2 builtin function but still emperically optimize the worst case ULP error. Hopefully NVidia doesn’t modify the builtin’s implementation, since any change at all would likely affect the my_erff() accuracy.

Two Norbert questions.
First quick one: Is your sign-setting float_as_int chain better than using the builtin C copysignf()?

With the long-tail erf() you obviously had to break up the evaluation to the smooth central lobe and the long exponential-like tail. Your comments talk about choosing the transition at t=1.0 to optimize error. But surely the transition choice has a dramatic impact on warp divergence and therefore speed. If you make an assumption that the input argument is a random sample distributed as a unit Gaussian (a very arguable assumption, but reasonable) then the cutoff of t=1.0 means 16% of the evaluations will take the “tail” branch. In a warp of 32 independent erf() evaluations (another assumption), the GPU will have to evaluate both code branches 99.6% of the time. If the cutoff were at t=2, then the GPU would diverge only 14% of the time. t=2.5 diverges 1.3%, and t=3.0 would effectively never diverge. And for t>3.75, erf() is constant 1.0 to floating point representation.

Given this assumption, does the fit quality really suffer when t cutoff is increased to 2.0 or more? Even if you needed an extra two or three FMA evaluations to keep the center lobe accuracy at ~1 ULP, it may be a speed win since the GPU won’t have to evaluate both branches nearly as often.

Of course then it becomes a question of how to define the speed measurement of such a function… you must assume some input distribution, be it Gaussian, a wide uniform range like [-10,10], or “all 2^32 representable floats”. And then you must also decide whether each thread in a warp has an indentical argument or independently sampled, or some other correlation.

For accuracy, clearly testing all 2^32 representable values is correct.

This is interesting. How do you mean? How does one become aware of their energy usage? If I want high throughput, isn’t this mutually exclusive?

Granted, I do undervolt my CPU :P

Re energy efficiency: in addition to the general principle of favoring computation over memory access, I currently pay attention to just a few specific items:

(1) Floating-point add takes about 1/2 to 1/4 the energy of a floating-point multiply. This is based on data from Mark Horowitz’s team at Stanford. So one would want to to prefer additions/subtractions where possible (= performance neutral). This is pretty much platform independent.

Compilers may take advantage of that automatically, but other than for FMA contractions, the CUDA compiler does not re-associate floating-point expressions, which is a Good Thing™ for programmer sanity, but may cause it to miss opportunities to use add instead of multiply.

(2) On the GPUs specifically, multiplications with ‘float’ constants that have the 12 least significant bits zero allow the compiler to incorporate the constant into the instruction (FMUL or FFMA) as an immediate, which saves the energy of retrieving the value from constant memory (cache access has energy use of >= 10x an arithmetic operation, if I recall Horowitz’s data correctly).

(3) Shifts and shift-add, require less energy than an integer multiply. The CUDA compiler mostly converts this automagically, but it also uses some clever (but potentially energetically expensive) idioms based on multiply-high, e.g. for sign extension if I recall correctly.

From what I can tell, the field of coding for energy efficiency is still in its infancy, and it is not clear how much difference software can make. One could also argue that classic optimization strategies such as minimizing instruction count, both dynamic (energy for instruction fetch & execution) and static (energy for ICache fills) lead to a reduction in energy use.

[Later:] Relevant presentation by Horowitz: https://www.futurearchs.org/sites/default/files/horowitz-ComputingEnergyISSCC.pdf

Re copysignf(): When it is known that the first operand is positive, its sign bit does not need to be cleared before inserting the sign bit of the second operand. One instruction saved. [Later:] I suspect that on architectures with support for LOP3 we do not actually save an instruction this way, I should take a closer look to update my knowledge base.

Re switchover between core approximations: There is no one optimal strategy, but generally speaking polynomial approximations tend to become very iffy (large errors) if the magnitude of the terms x2, x3, etc increases instead of decreases, even with FMA-based evaluation. So that gives a practical bound of 1.0 in case if erff(). From an accuracy standpoint, it would be better to move the switchover point slightly lower than 1.0, see my Q&A on erff() at Stackoverflow (http://stackoverflow.com/questions/35148198/efficient-faithfully-rounded-implementation-of-error-function-erff)

However, when comparing to existing implementations, my goal is to either (1) improve accuracy while maintaining performance, or (2) improve performance while maintaining accuracy, or (3) improve both accuracy and performance. In this way, potential users will not face new trade-offs. Since CUDA uses 1.0 as the switchover point (easy to see from disassembled code), and moving that down decreased performance slightly in my tests, I kept the switchover at 1.0.

Divergence can be avoided by computing both code branches and then applying a late selection. But we don’t know the argument distribution pattern up-front (different for each use case), so this approach could help or may hurt (increase in dynamic instruction count). Again to avoid surprises, I kept the existing branchy structure.

Ah, good points about copysignf, both about knowing the destination is already nonnegative, and also that LOP3LUT may make it one operation anyway.

That t=1 crossover is indeed a major tuning point in terms of performance and accuracy. Your benchmark shows that a value of 1.0 is a local performance optimum, but under what test condition? All threads in a warp computing identical, or independent values? Selecting arguments with uniform density over a range, a Gaussian, or exhaustive all 2^32 float representations?

I did not say that switchover at 1.0 is a performance optimum, just that it keeps the overall performance characteristics the same as CUDA, avoiding any negative surprises, and it suppports good accuracy.

In my tests, the distribution of arguments to erff() was whatever the application delivered, I did not try to characterize it. Your point about distributions is valid and interesting, but I am not aware of any studies about the distribution of numeric arguments to math operations or functions, other than very old work from the 1960s dealing with floating-point addition (where they were trying to establish the number of (de-)normalizaion shifts typically required), and the general Benford’s Law, which seems orthogonal to the switchover point issue.

This two-path design for erff() is unfortunate, I wish I’d have better ideas how to avoid it while being both fast and accurate. On the other hand, the performance is still quite good even when divergence is encountered. I know how to do single-path erfc(), but that approach itself is expensive, doesn’t really help with erf().

How does CUDA implement its fast exp2() PTX instruction? Is it really a special hardware unit, or is it expanded into microcode that does the usual floating point exponent + manitissa correction term, similar to the way you implemented DP exp()?

If it’s just expanded microcode, perhaps the ex2.approx could be replaced with your own routine (similar to your DP exp() code). That makes the function robust to CUDA toolkit changes, but even more importantly it may allow better tweaking of your final output quality since the exp2() FMA coefficients could be retuned specifically for this erff() evaluation to improve error or maybe even lower the needed polynomial order, giving higher speed.

I’m also curious if the log2, rsqrt(), and sin/cos “approx” operations in PTX CUDA are hardware accelerated too, or if they’re wrappers to Juffa-esque optimized bit manipulation+FMA. I suspect they must be hardware boosted because otherwise 1) they’d probably be expanded in a header file, not microcode, and 2) you wouldn’t use ex2.approx() for this erff() function instead of a bespoke version.

The MUFU (multi-function unit) that provides the EX2 functionality is a hardware unit based on quadratic interpolation in tables. The computation is done in fixed point, which is why, other than for RCP and RSQ, the RRO (range-reduction operation) instruction must be used first to transform the argument into the internal fixed-point representation. You can see the MUFU operations when you disassemble SASS code: MUFU.{RCP, RSQ, SIN, COS, LG2, EX2}

Stuart F. Oberman and Michael Y. Siu: A High-Performance Area-Efficient Multifunction Interpolator. In 17th IEEE Symposium on Computer Arithmetic, 2005, pp. 272-279 . http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.4275&rep=rep1&type=pdf

Thanks always for your elucidation! With references!

I had not ever looked at the SASS for the special approximation calls (because I’ve never explicitly used them.) Every time you post a new efficient function implementation I study it to learn more about the art and science of floating point. Thanks, Professor Juffa!

Code in #1 was updated with more accurate approximations.

You wouldn’t by chance have an implementation of the complementary error function erfc()?

I am worried about the loss of precision if I use 1.0f - my_erff().

Maybe going through double precision arithmetics with (float)(1.0 - my_erff()) is a reasonable workaround to maintain precision?

Christian

I am still tinkering with erfcf(). I should have something suitable for public consumption in a few weeks, if you can wait that long. The accuracy will be pretty good, but how it compares to CUDA’s built-in in terms of performance I have not check yet. I do not think it will be much faster.

I would suggest simply using the erfcf() built into CUDA for now, because going through erff() will be horribly inaccurate in the tails. Doing just the subtraction in double precision will not buy you anything, BTW.

It turns out my implementation of erfcf() was further along than I recalled, so I decided to post what I have available at the moment. While my_erfcf() is considerably more accurate than CUDA 8’s built-in erfcf(), at a maximum error of 2.660 ulps vs 4.487 ulps, it is pretty much a wash performance-wise. There is some scaling to avoid underflow in intermediate computation that I would love to get rid off, but I have not had any good ideas yet how to avoid that.

Below is performance data from my Quadro K2200 (sm_50). With -ftz=true the compiler from CUDA 8 adds a couple of extra instructions to my_erfcf() for no reason: the code should compile to an identical instruction sequence regardless of -ftz value (except for .FTZ suffixes on relevant instructions).

function calls/sec    -ftz=false  -ftz=true
-------------------------------------------
CUDA 8 erfc()         13.4e9      13.6e9
my_erfcf()            13.9e9      13.4e9

code below updated 12/1/2017, 12/25/2017

/*
  Copyright (c) 2017, Norbert Juffa
  All rights reserved.

  Redistribution and use in source and binary forms, with or without 
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright 
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

/* Compute exp(a) * 2**scale. Max ulp err = 0.86565 */
__device__ float expf_scale (float a, int scale)
{
    const float flt_int_cvt = 12582912.0f; // 0x1.8p23 
    float f, r, j, t;
    int i;

    // exp(a) = 2**i * exp(f); i = rintf (a / log(2))
    j = fmaf (1.442695f, a, flt_int_cvt); // // 0x1.715476p0 // log2(e)
    t = j - flt_int_cvt; 
    f = fmaf (t, -6.93145752e-1f, a); // -0x1.62e400p-1  // log_2_hi 
    f = fmaf (t, -1.42860677e-6f, f); // -0x1.7f7d1cp-20 // log_2_lo 
    i = __float_as_int (j);
    // approximate r = exp(f) on interval [-log(2)/2, +log(2)/2]
    r =             1.37853622e-3f;  // 0x1.696000p-10
    r = fmaf (r, f, 8.37326515e-3f); // 0x1.12600ap-7
    r = fmaf (r, f, 4.16694842e-2f); // 0x1.555b3ep-5
    r = fmaf (r, f, 1.66664705e-1f); // 0x1.55544ep-3
    r = fmaf (r, f, 4.99999851e-1f); // 0x1.fffff6p-2
    r = fmaf (r, f, 1.00000000e+0f); // 0x1.000000p+0
    r = fmaf (r, f, 1.00000000e+0f); // 0x1.000000p+0
    // exp(a) = 2**(i+scale) * r;
    r = __int_as_float (((i + scale) << 23) + __float_as_int (r));
    return r;
}

/*  
 * Based on: M. M. Shepherd and J. G. Laframboise, "Chebyshev Approximation of 
 * (1+2x)exp(x^2)erfc x in 0 <= x < INF", Mathematics of Computation, Vol. 36,
 * No. 153, January 1981, pp. 249-253.
 *
 * max ulp error: 2.6600177
 */  
__device__ float my_erfcf (float x)
{
    float a, d, e, m, p, q, r, s, t;

    a = fminf (fabsf (x), 10.0546875f); /* map argument to [0, 10.0546875] */

    /* Compute q = (a-2)/(a+2) accurately. [0, 10.0546875] -> [-1, 0.66818] */
    m = a - 2.0f;
    p = a + 2.0f;
    asm ("rcp.approx.ftz.f32 %0,%1;" : "=f"(r) : "f"(p));
    q = m * r;
    t = fmaf (q + 1.0f, -2.0f, a); 
    e = fmaf (-a, q, t); 
    q = fmaf (r, e, q); 
    
    /* Approximate (1+2*a)*exp(a*a)*erfc(a) as p(q)+1 for q in [-1, 0.66818] */
    p =             -4.01020050e-4f;  // -0x1.a48000p-12
    p = fmaf (p, q, -1.23063149e-3f); // -0x1.429a48p-10
    p = fmaf (p, q,  1.31360395e-3f); //  0x1.585a78p-10
    p = fmaf (p, q,  8.63233395e-3f); //  0x1.1add44p-7
    p = fmaf (p, q, -8.05991795e-3f); // -0x1.081b7ep-7
    p = fmaf (p, q, -5.42047322e-2f); // -0x1.bc0b90p-5
    p = fmaf (p, q,  1.64055333e-1f); //  0x1.4ffc3ep-3
    p = fmaf (p, q, -1.66031346e-1f); // -0x1.54083ep-3
    p = fmaf (p, q, -9.27639678e-2f); // -0x1.7bf612p-4
    p = fmaf (p, q,  2.76978403e-1f); //  0x1.1ba03ap-2

    /* Divide (1+p) by (1+2*a) ==> exp(a*a)*erfc(a) */
    t = a + a;
    d = fabsf (x) + a + 1.0f;  /* use fabs(x), not 'a', in case NaN argument */
    asm ("rcp.approx.ftz.f32 %0,%1;" : "=f"(r) : "f"(d));
    q = fmaf (p, r, r); // q = (p+1)/(1+2*a)
    e = (p - q) + fmaf (q, -t, 1.0f); // residual: (p+1)-q*(1+2*a)
    r = fmaf (e, r, q);

    /* Multiply by exp(-a*a) ==> erfc(a) */
    s = a * a; 
    e = expf_scale (-s, 24); /* scale by 2**24 to avoid intermediate underflow*/
    t = fmaf (a, -a, s);
    r = fmaf (r, e, r * e * t);
    r = r * 5.9604644775390625e-8f; /* scale down by 2**24 */

    /* Handle negative arguments to erfc() */
    r = (x < 0.0f) ? (2.0f - r) : r;

    return r;
}

I managed to tweak the code slightly by modifying the argument clamping and NaN handling. The code in #18 has been updated along with the performance data.

Did the updated code version lose this reference by chance or on purpose?

/*
    * Based on: M. M. Shepherd and J. G. Laframboise, "Chebyshev Approximation of 
    * (1+2x)exp(x^2)erfc x in 0 <= x < INF", Mathematics of Computation, Vol. 36,
    * No. 153, January 1981, pp. 249-253.
    */

Anyway, thanks for sharing this code - with reference or not.