Faster and more accurate implementation of log2f()

So I got carried away a bit :-) There are probably some additional improvements to be had here, in terms of both performance and accuracy, but I figured I’d wrap up this logarithm business for now.

The implementation of the binary logarithm below is faithfully rounded, with a maximum error of 0.91874 ulps, while CUDA’s current implementation has a maximum error of 1.91494 ulps.

The performance is improved vs CUDA 7.5. On my Quadro K2200 I measured log2f() throughput at 14.8 billion function evaluations per second in default mode, and at 16.0 billion function evaluations per second with -ftz=true. my_log2f() clocks in at 17.6 billion function evaluations per second.

[Code below updated 3/6/2019]

/*
  Copyright (c) 2015-2019, Norbert Juffa
  All rights reserved.

  Redistribution and use in source and binary forms, with or without 
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright 
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

/* Compute logarithm base 2. max ulp err = 0.91874 */
__device__ float my_log2f (float a)
{
    float m, r, i;
    int e;

    i = 0.0f;
    if (a < 1.175494351e-38f){ // 0x1.0p-126
        a = a * 8388608.0f; // 0x1.0p+23
        i = -23.0f;
    }
    e = (__float_as_int (a) - 0x3f3504f3) & 0xff800000;
    m = __int_as_float (__float_as_int (a) - e);
    i = fmaf ((float)e, 1.19209290e-7f, i); // 0x1.0p-23
    m = m - 1.0f;
    /* Compute log2(1+m) for m in [sqrt(0.5)-1, sqrt(2.0)-1] */
    r =              9.69543457e-2f;  //  0x1.8d2000p-4
    r = fmaf (r, m, -1.68501154e-1f); // -0x1.591722p-3
    r = fmaf (r, m,  1.71666995e-1f); //  0x1.5f92f2p-3
    r = fmaf (r, m, -1.78955182e-1f); // -0x1.6e800ep-3
    r = fmaf (r, m,  2.05116019e-1f); //  0x1.a413dep-3
    r = fmaf (r, m, -2.40456969e-1f); // -0x1.ec74b4p-3
    r = fmaf (r, m,  2.88567603e-1f); //  0x1.277e44p-2
    r = fmaf (r, m, -3.60675812e-1f); // -0x1.715500p-2
    r = fmaf (r, m,  4.80898589e-1f); //  0x1.ec70aep-2
    r = fmaf (r, m, -7.21347451e-1f); // -0x1.715474p-1
    r = r * m;
    r = r * m;
    r = fmaf (m, 1.44269502e+00f, r); //  0x1.715476p+0  // log2(e)
    r = r + i;
    /* Check for and handle special cases */
    if (!((a > 0.0f) && (a < __int_as_float (0x7f800000)))) { // +INF
        asm ("lg2.approx.ftz.f32 %0,%1;" : "=f"(r) : "f"(a));
    }
    return r;
}

The following variant is about 6% faster than the implementation in #1, with a minimal drop in accuracy as a trade-off (but also faithfully rounded):

/*
  Copyright (c) 2019, Norbert Juffa
  All rights reserved.

  Redistribution and use in source and binary forms, with or without 
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright 
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

/* Compute logarithm base 2. max ulp err = 0.94404 */
__device__ float my_log2f (float a)
{
    float m, r, i;
    int e;

    i = 0.0f;
    if (a < 1.175494351e-38f){ // 0x1.0p-126
        a = a * 8388608.0f; // 0x1.0p+23
        i = -23.0f;
    }
    e = (__float_as_int (a) - 0x3f3504f3) & 0xff800000;
    m = __int_as_float (__float_as_int (a) - e);
    i = fmaf ((float)e, 1.19209290e-7f, i); // 0x1.0p-23
    m = m - 1.0f;
    /* Compute log2(1+m) for m in [sqrt(0.5)-1, sqrt(2.0)-1] */
    r =             -1.09985352e-1f;  // -0x1.c28000p-4
    r = fmaf (r, m,  1.86182275e-1f); //  0x1.7d4d22p-3
    r = fmaf (r, m, -1.91066533e-1f); // -0x1.874de4p-3
    r = fmaf (r, m,  2.04593703e-1f); //  0x1.a30206p-3
    r = fmaf (r, m, -2.39627063e-1f); // -0x1.eac198p-3
    r = fmaf (r, m,  2.88573444e-1f); //  0x1.277fccp-2 
    r = fmaf (r, m, -3.60695332e-1f); // -0x1.715a1ep-2
    r = fmaf (r, m,  4.80897635e-1f); //  0x1.ec706ep-2
    r = fmaf (r, m, -7.21347392e-1f); // -0x1.715472p-1
    r = fmaf (r, m,  4.42695051e-1f); //  0x1.c551dap-2
    r = fmaf (r, m, m);
    r = r + i;
    /* Check for and handle special cases */
    if (!((a > 0.0f) && (a < __int_as_float (0x7f800000)))) { // +INF
        asm ("lg2.approx.ftz.f32 %0,%1;" : "=f"(r) : "f"(a));
    }
    return r;
}

I can always use a faster log2f(). :-)
In the next weeks I will start writing documentation on some work I’m currently doing. How do you prefer to be referenced in the acknowledgements? Is there any homepage or profile of yours to point to?

Note that this thread is about fast&accurate log2f(). If you just need fast, and don’t need very accurate, the intrinsic device function __log2f() should provide a massive speedup, and if your use case can tolerate flushing of subnormal floating-point numbers to zero, throw in compilation with -ftz=true for good measure. If you need other fast math as well, try the catch-all switch --use_fast_math.

Best I can tell from looking at disassembled code, CUDA now uses something close to the code in #1.

I normally need precision up to the 5th or 6th decimal point. I will write prepare a test unit based on data similar to what I use at work and put your function to work and see. Danke.

In that case, the code in #2 may indeed be of use, assuming your code is compute-bound and dominated by calls to log2f().

As for potential ackowledgement, you can put down my name as “N. Juffa”. I have no affiliation or website. These days, internet sources are widely accepted in scientific writing as far as I know, so if you wind up using the code above, you could simply use a link to this thread plus access date for the reference.