Faster and more accurate implementation of log1pf()

Recently I looked at some code that was doing computations based on many small probabilities, and used logarithmic representation for these. As one would expect, using this representation changes multiplies into additions which is largely performance neutral, but it also changes additions into fairly expensive calls to a function usually called logadd(). A simple implementation that leaves out special case handling might look like this:

float logaddf (float a, float b)
{
    float t;
    if (b > a) {
        t = b;
        b = a;
        a = t;
    }
    t = b - a;
    return a + log1pf (expf (t));
}

As it turned out, the code I looked at was using logadd() fairly extensively, and thus these calls made a significant contribution to kernel runtime. The most expensive part of logadd() is the call to log1pf(), where log1pf()x) computes log(1+x). So I set out to construct a fast, yet accurate version of log1pf(). After some experimentation, I arrived at a useful result which I thought I would share. I am placing this under an OSI-approved two-clause BSD license which should be compatible with any project.

I am showing two variants of the code below. Both achieve a maximum ulp error of 0.88628 ulps compared to the mathematical result, which means they are faithfully rounded. This compares favorably with the CUDA 7.5 implementation which has a maximum ulp error of 1.72239.

The first variant, with FASTPATH == 0, produces extremely compact code that is branchless, except for the handling of exceptional cases. This variant is most suitable when the arguments to log1pf() span the entire input domain, with a mix of large and small inputs. In this case it can be up to 18% faster than the CUDA 7.5 implementation, as measured on an sm_50 GPU. However, if the vast majority of inputs are close to zero, this variant can be up to 17% slower than CUDA 7.5.

The second variant, with FASTPATH == 1, adds a simple fast path for arguments close to zero. This is the same general code structure used by CUDA’s current log1pf() implementation. Due to the different kind of approximation used, and the tighter accuracy requirements, the fast path in my code applies to a somewhat narrower interval, [0.3046875, 0.609375], compared to the CUDA implementation, whose fast path covers the interval [-0.394, 0.65]. According to my tests, this variant is about 10% faster than log1pf() from CUDA 7.5 for various mixes of inputs, including when all inputs are very close to zero. This was measured on an sm_50 GPU in default mode. In FTZ mode (no denormal support), the performance advantage shrinks, but was still above 4% during testing.

[code below updated 12/13/2015 and again updated 12/26/2015]
[code obsolete, see #14 below for the latest version dated 7/19/2016]

[s]/*
  Copyright (c) 2015, Norbert Juffa
  All rights reserved.

  Redistribution and use in source and binary forms, with or without 
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright 
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

#define FASTPATH 1

#if FASTPATH == 0

/* log1p(a) = log(a+1) = log(2**e * t) = log(2)*e + log(t). With t = m + 1,
   log1p(a) = log(2)*e + log1p(m). Choose e such that m is in [-0.25, 0.5], 
   with s = 2**(-e) we then have m = s*(a+1) - 1 = s*a + (s - 1). The scale
   factor s can be applied to a by exponent manipulation, for the computation
   of (s - 1) we use an intermediate scale factor s' = 4*s to ensure this is
   representable as a normalized floating-point number.

   max ulp err = 0.88383
*/
__device__ float my_log1pf (float a)
{
    float m, r, s, i; 
    int e;

    m = __fadd_rz (a, 1.0f);
    e = (__float_as_int (m) - 0x3f400000) & 0xff800000;
    m = __int_as_float (__float_as_int (a) - e);
    s = __int_as_float (__float_as_int (4.0f) - e); // s' in [2**-126,2**26]
    m = m + fmaf (0.25f, s, -1.0f);
    i = (float)e * 1.19209290e-7f; // 2**-23
    // approximate log(1+m) on [-0.25, 0.5]
    r =             -4.53482792e-2f;  // -0x1.737e3cp-5
    r = fmaf (r, m,  1.05469480e-1f); //  0x1.b000c4p-4
    r = fmaf (r, m, -1.32296383e-1f); // -0x1.0ef168p-3
    r = fmaf (r, m,  1.44914404e-1f); //  0x1.28c8e2p-3
    r = fmaf (r, m, -1.66415766e-1f); // -0x1.54d1cap-3
    r = fmaf (r, m,  1.99888632e-1f); //  0x1.995f36p-3
    r = fmaf (r, m, -2.50002027e-1f); // -0x1.000088p-2
    r = fmaf (r, m,  3.33335131e-1f); //  0x1.5555cep-2
    r = fmaf (r, m, -5.00000000e-1f); // -0x1.000000p-1
    r = r * m;
    r = fmaf (r, m, m);
    r = fmaf (i, 6.93147182e-01f, r); // log(2)
    if (!((a != 0) && (a > -1.0f) && (a < __int_as_float (0x7f800000)))) {
        r = a + a;  // handle inputs of NaN, +Inf
        if (a <  -1.0f) r = __int_as_float (0x7fffffff); //  NaN
        if (a == -1.0f) r = __int_as_float (0xff800000); // -Inf
    }
    return r;
}

#elif FASTPATH == 1

/* compute log (1+a), with a special fast path for arguments close to zero.
   max ulp err = 0.88606
*/
__device__ float my_log1pf (float a)
{
    float m, r, s, i; 
    int e;

    if ((-0.3046875f <= a) && (a <= 0.609375f)) {
        r =              3.28586996e-2f;  //  0x1.0d2db0p-5
        r = fmaf (r, a, -9.10350382e-2f); // -0x1.74e128p-4
        r = fmaf (r, a,  1.22964852e-1f); //  0x1.f7a9fep-4
        r = fmaf (r, a, -1.30156621e-1f); // -0x1.0a8f8ep-3
        r = fmaf (r, a,  1.42396331e-1f); //  0x1.23a0b0p-3
        r = fmaf (r, a, -1.66184038e-1f); // -0x1.54584cp-3
        r = fmaf (r, a,  1.99986562e-1f); //  0x1.99928ep-3
        r = fmaf (r, a, -2.50015855e-1f); // -0x1.000428p-2
        r = fmaf (r, a,  3.33333999e-1f); //  0x1.555582p-2
        r = fmaf (r, a, -4.99999851e-1f); // -0x1.fffff6p-2
        r = r * a;
        r = fmaf (r, a, a);
    } else {
        m = a + 1.0f;
        e = (__float_as_int (m) - 0x3f400000) & 0xff800000;
        m = __int_as_float (__float_as_int (a) - e);
        s = __int_as_float (__float_as_int (4.0f) - e); // s' in [2**-126,2**26]
        m = m + fmaf (0.25f, s, -1.0f);
        i = (float)e * 1.19209290e-7f; // 2**-23
        // approximate log(1+m) on [-0.25, 0.5]
        r =             -4.53482792e-2f;  // -0x1.737e3cp-5
        r = fmaf (r, m,  1.05469480e-1f); //  0x1.b000c4p-4
        r = fmaf (r, m, -1.32296383e-1f); // -0x1.0ef168p-3
        r = fmaf (r, m,  1.44914404e-1f); //  0x1.28c8e2p-3
        r = fmaf (r, m, -1.66415766e-1f); // -0x1.54d1cap-3
        r = fmaf (r, m,  1.99888632e-1f); //  0x1.995f36p-3
        r = fmaf (r, m, -2.50002027e-1f); // -0x1.000088p-2
        r = fmaf (r, m,  3.33335131e-1f); //  0x1.5555cep-2
        r = fmaf (r, m, -5.00000000e-1f); // -0x1.000000p-1
        r = r * m;
        r = fmaf (r, m, m);
        r = fmaf (i, 6.93147182e-01f, r); // log(2)
        if (!((a > -1.0f) && (a < __int_as_float (0x7f800000)))) { // +INF
            r = a + a;  // handle inputs of NaN, +Inf
            if (a <  -1.0f) r = __int_as_float (0x7fffffff); //  NaN
            if (a == -1.0f) r = __int_as_float (0xff800000); // -Inf
        }
    }
    return r;
}[/s]
#endif // FASTPATH

Norbert, as always, your microfunction optimization reports are always appreciated and interesting!
I have some general curious questions about your fitting strategy (not specifically about this function, but in general).

I assume you’re taking the range and starting with Chebychev or maybe even a full minimax polynomial fit to reduce absolute error. Probably using Maple or Mathematica?

Do you find the ULP error by iterating over every possible float in the range? Exhaustive test evaluation is easy for floats, but impractical for doubles, so how do you find the ULP for doubles?

After that fit, do you ever tweak or optimize the coefficients to minimize ACTUAL error when evaluated with IEEE rounding and precision loss? I could imagine a greedy final process that literally iterates combinations the low bits of the polynomial coefficients hoping to find a slightly better absolute and/or RMS ULP error. This is getting down to nano-optimization, but for library functions, that’s not inappropriate.

Any worries about using a float like “1.99888676e-1f” in the source code? You could alternatively use a hexadecimal coded float to make sure the compiler interprets every bit as you desire. This is part of the C (but not C++) standard, but all major C++ compilers support it without warning.

The polynomial approximations are minimax approximation whose initial versions I generated with my own software that is based on the Remez algorithm. I refined the initial versions further for FMA-based evaluation in finite arithmetic using my own heuristics, which are admittedly not very good, so it takes a long time to refine, typically a couple of days. Tor Myklebust has a draft paper available on a much superior heuristic: http://arxiv.org/abs/1508.03211

Exhaustive testing for single-input single-precision functions is eminently feasible. For multi-input single-precision functions or double-precision functions, testing cannot be exhaustive at current processor speeds, but sampling with several hundreds of millions to a few billions of random test vectors targeted at relevant intervals should pinpoint the maximum error fairly accurately. One way of assessing the accuracy of the sampling process is to use equivalent sampling for single-precision implementations where reference results from exhaustive testing are available for comparison.

I am familiar with C99-style hexadecimal floating-point literals and the potential risk of using decimal floating-point literals due to pesky bugs in most decimal-to-binary conversion routines (http://www.exploringbinary.com/). On the other hand, I will point out that I have not actually encountered any such bugs when using decimal literal constants for math functions in the past 12 years.

Unfortunately, hexadecimal floating-point literals are not supported in standard C++, including C++11, unless I missed it somehow. Therefore, they are not supported by the CUDA compiler, at least last I checked. I am reasonably sure that my Intel compiler (from 2013) does not accept hexadecimal floating-point literals unless I specify /Qstd=c99, i.e. C compilation instead of C++ compilation. My (older) version of MSVC also does not accept them. So I assume “major C++ compilers” refers to gcc specifically?

I utilize hexadecimal floating-point literals extensively throughout my work and only convert them to decimal floating-point literals for the final CUDA version which then gets re-tested to make sure it delivers identical results.

To give an idea who much of a difference the tweaking of coefficients can make for the accuracy, here is the starting point for the fast-path polynomial on the interval [-39/128, +78/128].

I am employing the Remez algorithm in conjunction with an arbitrary-precision floating-point library, in my case R. P. Brents MP library configured for around 1000 bits of precision. The Remez process delivers a mathematical approximation with coefficients that are also represented with that precision. As we need to operate on IEEE-754 single-precision coefficients instead, for the initial approximation one can then round each of these coefficients to ‘float’:

c0 =  3.31721115e-2f;
c1 = -9.09794979e-2f;
c2 =  1.22401768e-1f;
c3 = -1.29876995e-1f;
c4 =  1.42439444e-1f;
c5 = -1.66227370e-1f;
c6 =  1.99987683e-1f;
c7 = -2.50013837e-1f;
c8 =  3.33333915e-1f;
c9 = -4.99999879e-1f;
c10=  9.99999998e-1f;

Evaluating the polynomial approximation with these coefficients using fmaf(), a maximum error of 1.53667 ulps is observed across the approximation interval. The steps taking during the tweaking process adjust the coefficients as follows:

(1) The most significant coefficient is forced to 1. This results in an approximation of the form p(x) = x + R(x), which is known to have superior numerical properties, particularly when evaluated using FMA operations.

(2) The coefficients are forced to be machine numbers, i.e. exactly representable in single precision.

(3) The coefficients are tweaked taking into account that evaluation of the polynomial uses FMA, which provides the advantage of guarding against subtractive cancellation.

The tweaking heuristic I use is basically a crude, unsophisticated form of simulated annealing. With sufficient patience (many iterations, a couple of days of runtime) it delivers pretty good results. Tor Myklebust’s heuristic by comparison is very sophisticated, therefore much faster (it delivers results in minutes instead of days), and often produces significantly more accurate approximations.

Ah, yes, of course you have your own bespoke minimax toolkit. You humbly praise Tor Myklebust’s tuning method, but your own obviously works and computers are patient, so it doesn’t matter much if it takes a day or even a week!

What is the optimization criteria? Lowest possible worst-case ULP, and if you reduce that to under 1, you reduce RMS residual ULP instead? Perhaps you choose the interpolating polynomial order to get the worst-case to under 1, then just use RMS as your sole tuning parameter? You report here JUST the worst case ULP, though, so maybe you never use RMS?

As part of the tuning, especially in CUDA, do you ever iterate through each FMA’s rounding mode options? I could imagine that it’s an extra few bits of freedom to try to get some fortuitous cancellation. This is likely at atto-optimization level, but I guess so is adjusting the low bits of the coefficients.

A lot of the art must also be in deciding in the best algorithm based on use cases. In this function, the FASTPATH choice is significant. Likely there’s always a family of range reduction and test choices to make. One issue especially in CUDA is branching, since a divergent warp is so expensive if threads in the warp have different argument ranges. In your FASTPATH code, you said you tested with a “mix of inputs” though so you’re clearly optimizing this too. In your example you have existing user code so you can fortunately tune for your user’s exact application, yet still make general routines.

Thanks again for the code and the explanation!

For those that want to generate their own approximations I would suggest looking at Sollya (http://sollya.gforge.inria.fr/) especially if they don’t have access to expensive tools like Maple and Mathematica. I am using my own software primarily because I have been using it for 25 years and it has worked well enough for me. From what I have seen of Sollya, it can generate better initial approximations with its fpminimax command.

I can’t stress enough the advantages of Tor Myklebust’s approach. I actually did an apples-to-apples comparison using an approximation to atan() on [0,1], which is the problem I initially posted to StackOverflow. With my crude heuristic, it took 750 CPU-hours to generate a solution with the same ulp error his heuristic produced in 90 seconds. That is the difference between deep mathematical thought applied to the issue versus an engineering approach relying predominantly on brute force.

In his paper Tor gives an even better result for this atan() approximation, which nonetheless took only minutes to generate. In the practical context of implementing math functions, waiting weeks for a single polynomial approximation just does not fly, while waiting minutes is certainly OK.

I have not used RMS as a quality criterion in some twenty years. My primary optimization goal is minimizing ulp error. The secondary optimization goal is to minimize the number of incorrectly rounded results. In the final stages of optimization lowering ulp error typically slightly increase the number of incorrectly rounded results.

I like to think about this effect intuitively by considering the graph of relative error, that is (approx-func)/func. As one pushes down on the bulges of that graph to minimize the maximum error, the bulges become slightly wider, and the area under the curve is roughly proportional to the number of incorrectly rounded results.

I have looked in the past at playing with the rounding modes of FMAs in a Horner-style polynomial evaluation and found that it never helped. My explanation for this is that the error introduced by changing the rounding mode is typically larger than a 1 ulp change to the next less significant coefficient, so it is better to tweak the coefficients.

You are correct that choosing the overall code structure for a math function is a difficult design problem. There are preferences based on architecture, e.g. on GPUs we would in general prefer branchless code. We would also like to minimize double-precision operations. But there are also preferences based on use cases, such as knowing that log1p() was specifically introduced to deliver accurate results for log(x) near unity, so in many use cases we can expect the vast majority of inputs to log1p() to be near zero. Obviously, no one single approach is going to deliver optimal performance for all use cases, so judgment calls need to be made.

One of the computations performed hundred of millions times in my simulation work is expf(-a*b).
My assumption is that this call is expensive and I do generally use the --use_fast_math flag during compilation.

https://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations

Are you aware of any other options for this computation?

I am afraid that expf() in CUDA is already as fast as it can be, especially when you use --use_fast_math. It basically falls back to MUFU.EX2, and this approach already sacrifices accuracy for performance by design, since the speed of exponentiation is known to be critical for many different use cases. I could tell you how to construct a more accurate version of expf() that is faithfully rounded :-)

Given that you already use --use_fast_math, I have serious doubts this computation can be sped up further, even with severe restrictions placed on the input domain of expf(), or further sacrifices to accuracy. The throughput of expf() when compiled with --use_fast_math should be several hundred billion function calls per second on your Titan X. I would guess around 300e9 function calls per second.

If you could utilize exp2f() instead of expf() without adding computation elsewhere, that may give you a small speedup.

To give an idea of the trade-offs affecting expf(), I ran some tests on my Quadro K2200, an sm_50 GPU essentially equivalent to a GTX 750 Ti with larger memory. I used a mix of inputs spanning almost the entire non-overflowing, non-underflowing input domain for the performance measurements. Accuracy tests were performed exhaustively.

With --use_fast_math, expf() compiles to a three-instruction sequence. The throughput is approximately 150e9 function calls per second; my framework is not very accurate when measuring such very short code sequences. The maximum ulp error is 63.78065.

With --ftz=true, the throughput of expf() is 39.2e9 function calls per second, the maximum ulp error is 2.31917. With --ftz=false, the throughput of expf() is 31.8e9 function calls per second, the maximum ulp error is also 2.31917.

My efficient, faithfully rounded, implementation below that cannot use the hardware’s MUFU.EX2 instruction for accuracy reasons clocks in at 25.6e9 function calls per second, while limiting maximum error to 0.88250 ulps. As can be seen from the code, a number of instructions are needed for the scaling required to produce the final result. This is where a scaling instruction provided by the hardware would come in handy. As far as I know both AVX-512 and AMD GPUs provide such an instruction.

/*
  Copyright (c) 2015, Norbert Juffa
  All rights reserved.

  Redistribution and use in source and binary forms, with or without 
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright 
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

/* for a in [0.5, 4), compute a * 2**i, -250 < i < 250 */
__device__ __forceinline__ float fast_ldexpf (float a, int i)
{
    int ia = (i << 23) + __float_as_int (a); // scale by 2**i
    a = __int_as_float (ia);
    if ((unsigned int)(i + 125) > 250u) { // |a| > 125
        i = (i ^ (125 << 23)) - i;    // ((i < 0) ? -125 : 125) << 23
        a = __int_as_float (ia - i);
        a = a * __int_as_float ((127 << 23) + i);
    }
    return a;
}

/* Compute exponential base e. Maximum ulp error = 0.88250 */
__device__ float my_expf (float a)
{
    float f, r;
    int i;

    // exp(a) = exp(i + f); i = rintf (a / log(2))
    r = fmaf (1.44269502e+00f, a, 1.25829120e+07f) - 1.25829120e+07f;
    i = (int)r;
    f = fmaf (r, -6.93145752e-01f, a); // log_2_hi
    f = fmaf (r, -1.42860677e-06f, f); // log_2_lo
    // approximate r = exp(f) on interval [-log(2)/2,+log(2)/2]
    r =             1.38367829e-03f;
    r = fmaf (r, f, 8.37474316e-03f);
    r = fmaf (r, f, 4.16684374e-02f);
    r = fmaf (r, f, 1.66664585e-01f);
    r = fmaf (r, f, 4.99999911e-01f);
    r = fmaf (r, f, 1.00000000e+00f);
    r = fmaf (r, f, 1.00000000e+00f);
    // exp(a) = 2**i * exp(f);
    r = fast_ldexpf (r, i);
    if (!(fabsf (a) < 104.f)) {
        r = a + a;
        if (a < 0.0f) r = 0.0f;
        if (a > 0.0f) r = __int_as_float (0x7f800000); // + INF
    }
    return r;
}

Wow, Thanks!

With these simulations there is a balance between performance and accuracy, so I need to ask the scientists which implementation they would prefer.

Surprised at the large difference in maximum ulp error considering the fast_math qualifier. Big enough to make a difference in output results.

The fast math implementation of __expf() is affected by the well-known issue of error magnification that exists for all exponential functions including pow(). The relative and ulp error grows with the magnitude of the argument, see the relevant appendix of the CUDA Programming Guide for details (look for __expf(), which is the fast math implementation of expf()).

Some while ago, I outlined the reason for the error magnification inherent in exponentation in this Stackoverflow answer: http://stackoverflow.com/a/29869677/780717. The original error that is being magnified in the case of __expf() stems form the scaling of the argument to exp2() when used to compute exp().

High-quality implementations of these functions take steps to counter-act the error magnification issue, but this comes at the cost of additional instructions and thus increased execution time.

I have updated the code in my original post with a new version that changes the handling of special cases into a conditional overwrite of the result. This provides a nice incremental performance boost, partially because use of this code idiom increases instruction level parallelism.

I have updated the code in my original posting with new polynomial approximations that provide slightly better error bounds.

Making use of the lessons I learned from recent improvements to my logf() implementation, I was able to create a compact log1pf() implementation that is faster than CUDA 7.5 for any mix of inputs, and offers superior accuracy at the same time. On a Quadro K2200 I measure a throughput of 22.6e9 function calls per second, versus 20.6e9 function calls for CUDA 7.5 for the most favorable mix of inputs (all arguments near zero). For a general mix of arguments CUDA 7.5 performs at 14.9e9 function calls per second. The maximum error of my implementation is 0.87391 ulp, vs 1.72239 ulp for CUDA 7.5.

[code below updated on 9/30/2016, 12/10/2016, 1/25/2017, 8/25/2020]

/*
  Copyright (c) 2015-2020, Norbert Juffa
  All rights reserved.

  Redistribution and use in source and binary forms, with or without 
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright 
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

/* log1p(a) = log(a+1) = log(2**e * t) = log(2)*e + log(t). With t = m + 1,
   log1p(a) = log(2)*e + log1p(m). Choose e such that m is in [-0.25, 0.5], 
   with s = 2**(-e) we then have m = s*(a+1) - 1 = s*a + (s - 1). Instead 
   of using s directly, an intermediate scale factor s' = 4*s is utilized 
   to ensure this is representable as a normalized floating-point number.

   max ulp err = 0.87391
*/
__device__ float my_log1pf (float a)
{
    float m, r, s, t, u; 
    int e;

    u = a + 1.0f;
    e = (__float_as_int (u) - __float_as_int (0.75f)) & 0xff800000;
    m = __int_as_float (__float_as_int (a) - e);
    s = __int_as_float (__float_as_int (4.0f) - e); // s' in [2**-126,2**26]
    m = m + fmaf (0.25f, s, -1.0f);
    // approximate log(1+m) on [-0.25, 0.5]
    s = m * m;
    r =             -4.53948975e-2f;  // -0x1.73e000p-5
    t =              1.05468750e-1f;  //  0x1.b00000p-4
    r = fmaf (r, s, -1.32274166e-1f); // -0x1.0ee5c2p-3
    t = fmaf (t, s,  1.44911751e-1f); //  0x1.28c77ep-3
    r = fmaf (r, s, -1.66412786e-1f); // -0x1.54d03ap-3
    t = fmaf (t, s,  1.99887156e-1f); //  0x1.995e70p-3
    r = fmaf (r, s, -2.50002742e-1f); // -0x1.0000b8p-2
    r = fmaf (t, m, r);
    r = fmaf (r, m,  3.33335280e-1f); //  0x1.5555d8p-2
    r = fmaf (r, m, -4.99999970e-1f); // -0x1.fffffep-2
    r = fmaf (r, s, m);
    r = fmaf ((float)e, 0.693147182f * 1.1920929e-7f, r); // log(2) * 0x1.0p-23
    if (!((a != 0.0f) && (u > 0.0f) && (a < __int_as_float (0x7f800000)))) {
        asm ("lg2.approx.ftz.f32 %0,%1;" : "=f"(r) : "f"(u));
        r = __fadd_rd (r, a); // handle negative zero
    }
    return r;
}

I have further improved the accuracy of the core approximation used by the code in #14, and have reduced the least significant coefficient to half precision, so it can be incorporated into the SASS instruction as a floating-point literal, which should improve energy efficiency.

Switched the core approximation of the code in #14 from a partial Estrin scheme to a second-order Horner scheme, which provides more instruction-level parallelism and allows the use of a second half-precision coefficient for reduced energy use. Performance improved by a couple of percent (as measured on a Maxwell-based Quadro K2200), the maximum error was further reduced to 0.87454 ulp.

[Later:] I finally got around to installing CUDA 8.0, and now see the following performance on the Quadro K2200:

CUDA 8.0 built-in log1pf: 21.9e9 function calls / second
Latest version my_log1pf: 23.3e9 function calls / second

That’s about 6% faster. Max error is 0.88626 ulp vs 0.87454 ulp.