Recently I looked at some code that was doing computations based on many small probabilities, and used logarithmic representation for these. As one would expect, using this representation changes multiplies into additions which is largely performance neutral, but it also changes additions into fairly expensive calls to a function usually called logadd(). A simple implementation that leaves out special case handling might look like this:
float logaddf (float a, float b)
{
float t;
if (b > a) {
t = b;
b = a;
a = t;
}
t = b - a;
return a + log1pf (expf (t));
}
As it turned out, the code I looked at was using logadd() fairly extensively, and thus these calls made a significant contribution to kernel runtime. The most expensive part of logadd() is the call to log1pf(), where log1pf()x) computes log(1+x). So I set out to construct a fast, yet accurate version of log1pf(). After some experimentation, I arrived at a useful result which I thought I would share. I am placing this under an OSI-approved two-clause BSD license which should be compatible with any project.
I am showing two variants of the code below. Both achieve a maximum ulp error of 0.88628 ulps compared to the mathematical result, which means they are faithfully rounded. This compares favorably with the CUDA 7.5 implementation which has a maximum ulp error of 1.72239.
The first variant, with FASTPATH == 0, produces extremely compact code that is branchless, except for the handling of exceptional cases. This variant is most suitable when the arguments to log1pf() span the entire input domain, with a mix of large and small inputs. In this case it can be up to 18% faster than the CUDA 7.5 implementation, as measured on an sm_50 GPU. However, if the vast majority of inputs are close to zero, this variant can be up to 17% slower than CUDA 7.5.
The second variant, with FASTPATH == 1, adds a simple fast path for arguments close to zero. This is the same general code structure used by CUDA’s current log1pf() implementation. Due to the different kind of approximation used, and the tighter accuracy requirements, the fast path in my code applies to a somewhat narrower interval, [0.3046875, 0.609375], compared to the CUDA implementation, whose fast path covers the interval [-0.394, 0.65]. According to my tests, this variant is about 10% faster than log1pf() from CUDA 7.5 for various mixes of inputs, including when all inputs are very close to zero. This was measured on an sm_50 GPU in default mode. In FTZ mode (no denormal support), the performance advantage shrinks, but was still above 4% during testing.
[code below updated 12/13/2015 and again updated 12/26/2015]
[code obsolete, see #14 below for the latest version dated 7/19/2016]
/*
Copyright (c) 2015, Norbert Juffa
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#define FASTPATH 1
#if FASTPATH == 0
/* log1p(a) = log(a+1) = log(2**e * t) = log(2)*e + log(t). With t = m + 1,
log1p(a) = log(2)*e + log1p(m). Choose e such that m is in [-0.25, 0.5],
with s = 2**(-e) we then have m = s*(a+1) - 1 = s*a + (s - 1). The scale
factor s can be applied to a by exponent manipulation, for the computation
of (s - 1) we use an intermediate scale factor s' = 4*s to ensure this is
representable as a normalized floating-point number.
max ulp err = 0.88383
*/
__device__ float my_log1pf (float a)
{
float m, r, s, i;
int e;
m = __fadd_rz (a, 1.0f);
e = (__float_as_int (m) - 0x3f400000) & 0xff800000;
m = __int_as_float (__float_as_int (a) - e);
s = __int_as_float (__float_as_int (4.0f) - e); // s' in [2**-126,2**26]
m = m + fmaf (0.25f, s, -1.0f);
i = (float)e * 1.19209290e-7f; // 2**-23
// approximate log(1+m) on [-0.25, 0.5]
r = -4.53482792e-2f; // -0x1.737e3cp-5
r = fmaf (r, m, 1.05469480e-1f); // 0x1.b000c4p-4
r = fmaf (r, m, -1.32296383e-1f); // -0x1.0ef168p-3
r = fmaf (r, m, 1.44914404e-1f); // 0x1.28c8e2p-3
r = fmaf (r, m, -1.66415766e-1f); // -0x1.54d1cap-3
r = fmaf (r, m, 1.99888632e-1f); // 0x1.995f36p-3
r = fmaf (r, m, -2.50002027e-1f); // -0x1.000088p-2
r = fmaf (r, m, 3.33335131e-1f); // 0x1.5555cep-2
r = fmaf (r, m, -5.00000000e-1f); // -0x1.000000p-1
r = r * m;
r = fmaf (r, m, m);
r = fmaf (i, 6.93147182e-01f, r); // log(2)
if (!((a != 0) && (a > -1.0f) && (a < __int_as_float (0x7f800000)))) {
r = a + a; // handle inputs of NaN, +Inf
if (a < -1.0f) r = __int_as_float (0x7fffffff); // NaN
if (a == -1.0f) r = __int_as_float (0xff800000); // -Inf
}
return r;
}
#elif FASTPATH == 1
/* compute log (1+a), with a special fast path for arguments close to zero.
max ulp err = 0.88606
*/
__device__ float my_log1pf (float a)
{
float m, r, s, i;
int e;
if ((-0.3046875f <= a) && (a <= 0.609375f)) {
r = 3.28586996e-2f; // 0x1.0d2db0p-5
r = fmaf (r, a, -9.10350382e-2f); // -0x1.74e128p-4
r = fmaf (r, a, 1.22964852e-1f); // 0x1.f7a9fep-4
r = fmaf (r, a, -1.30156621e-1f); // -0x1.0a8f8ep-3
r = fmaf (r, a, 1.42396331e-1f); // 0x1.23a0b0p-3
r = fmaf (r, a, -1.66184038e-1f); // -0x1.54584cp-3
r = fmaf (r, a, 1.99986562e-1f); // 0x1.99928ep-3
r = fmaf (r, a, -2.50015855e-1f); // -0x1.000428p-2
r = fmaf (r, a, 3.33333999e-1f); // 0x1.555582p-2
r = fmaf (r, a, -4.99999851e-1f); // -0x1.fffff6p-2
r = r * a;
r = fmaf (r, a, a);
} else {
m = a + 1.0f;
e = (__float_as_int (m) - 0x3f400000) & 0xff800000;
m = __int_as_float (__float_as_int (a) - e);
s = __int_as_float (__float_as_int (4.0f) - e); // s' in [2**-126,2**26]
m = m + fmaf (0.25f, s, -1.0f);
i = (float)e * 1.19209290e-7f; // 2**-23
// approximate log(1+m) on [-0.25, 0.5]
r = -4.53482792e-2f; // -0x1.737e3cp-5
r = fmaf (r, m, 1.05469480e-1f); // 0x1.b000c4p-4
r = fmaf (r, m, -1.32296383e-1f); // -0x1.0ef168p-3
r = fmaf (r, m, 1.44914404e-1f); // 0x1.28c8e2p-3
r = fmaf (r, m, -1.66415766e-1f); // -0x1.54d1cap-3
r = fmaf (r, m, 1.99888632e-1f); // 0x1.995f36p-3
r = fmaf (r, m, -2.50002027e-1f); // -0x1.000088p-2
r = fmaf (r, m, 3.33335131e-1f); // 0x1.5555cep-2
r = fmaf (r, m, -5.00000000e-1f); // -0x1.000000p-1
r = r * m;
r = fmaf (r, m, m);
r = fmaf (i, 6.93147182e-01f, r); // log(2)
if (!((a > -1.0f) && (a < __int_as_float (0x7f800000)))) { // +INF
r = a + a; // handle inputs of NaN, +Inf
if (a < -1.0f) r = __int_as_float (0x7fffffff); // NaN
if (a == -1.0f) r = __int_as_float (0xff800000); // -Inf
}
}
return r;
}
#endif // FASTPATH