On the utility of SFU instructions for half-precision math functions

I was toying with the idea how to use the SFU instructions for half-precision math functions of good accuracy (with a 11-bit mantissa, you can’t afford many ulps of error). By performing all intermediate computation in single-precision, one can side-step issues with underflow and overflow in intermediate computation, and the known numerical issues with use of the SFU instructions seem to mostly disappear once the final result is converted to half precision. The SFU instructions also provide decent special case handling, which saves instructions for handling those.

Below is one example, a half-precision version of sincos(). Note that the __fp16 type used is my own ‘typedef’; use the half-precision type name of your choice there. One nasty surprise was that there seems to be a consistent bias in either the argument reduction instruction RRO.SINCOS, or the MUFU.SIN and MUFU.COS instructions themselves (maybe due to an internal computation that is truncating instead of rounding?). So I had to add a bias-adjustment to the reduced argument for accurate results, which unfortunately currently adds three instructions.

Two variants are shown, with the more accurate one providing results that differ by at most 2 ulps from the correctly rounded results, while the less accurate one provides results that differ by at most 4 ulps from correctly rounded results, the worst case errors for the latter occur at:

[sin] error: arg= 1.06500000e+3 6429   res=-9.06586647e-5 85f1   ref=-9.04330600e-5 85ed
[sin] error: arg=-1.06500000e+3 e429   res= 9.06586647e-5 05f1   ref= 9.04330600e-5 05ed
/*
  Copyright (c) 2016, Norbert Juffa
  All rights reserved.

  Redistribution and use in source and binary forms, with or without 
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright 
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
__forceinline__ __device__ float raw_sin (float a)
{
    float r;
    asm ("sin.approx.ftz.f32 %0,%1;" : "=f"(r) : "f"(a));
    return r;
}

__forceinline__ __device__ float raw_cos (float a)
{
    float r;
    asm ("cos.approx.ftz.f32 %0,%1;" : "=f"(r) : "f"(a));
    return r;
}

__forceinline__ __device__ float copysignf_pos (float a, float b)
{
    float r;
    r = __int_as_float (__float_as_int (a) | (__float_as_int (b) & 0x80000000));
    return r;
}

/* Use 'sf' suffix as per proposal N2016 in ISO/IEC JTC1 SC22 WG14 */
__device__ void my_sincossf (__fp16 ah, __fp16 *sp, __fp16 *cp)
{
    float a, i, j, r, s, c;
    const float BIAS_ADJ = 8.8e-8f;

    a = __half2float (ah);

#if MORE_ACCURATE // maximal difference from correctly rounded result: 2 ulps   
    i = fmaf (a, 0.636619747f/2, 12582912.0f); // 1/pi, 0x1.8p+23
    j = i - 12582912.0f; // 0x1.8p+23
    r = fmaf (j, -3.14159203e+0f, a); // -0x1.921fb0p+01 // pi_high
    r = fmaf (j, -6.27832947e-7f, r); // -0x1.5110b4p-21 // pi_low
    r = r + copysignf_pos (BIAS_ADJ, r); // correct bias of trig func intrinsics
    c = raw_cos (r);
    s = raw_sin (r);
    if (__float_as_int (i) & 1) {
        s = 0.0f - s; // don't change "sign" of NaNs or create negative zeros
        c = 0.0f - c; // don't change "sign" of NaNs or create negative zeros
    }
#else // maximal difference from correctly rounded result: 3 ulps 
    i = fmaf (a, 0.636619747f/4, 12582912.0f); // 1/(2*pi), 0x1.8p+23
    j = i - 12582912.0f; // 0x1.8p+23
    r = fmaf (j, -6.28318405e+0f, a); // -0x1.921fb0p+02 // 2pi_high
    r = fmaf (j, -1.25566589e-6f, r); // -0x1.5110b4p-20 // 2pi_low
    r = r + copysignf_pos (BIAS_ADJ, r); // correct bias of trig func intrinsics
    c = raw_cos (r);
    s = raw_sin (r);
#endif

    *sp = __float2half_rn (s);
    *cp = __float2half_rn (c);
}

Thanks as always for your careful tests and expositions. What is performance of this as compared to your fp32 FMA based sincosf() and CUDA’s own full sin/cos speed? Here of course performance would depend on whether inputs were fp16 or fp32 to begin with. And I could imagine this routine here as-is could have a variant that returned the fp32 approximation before downsampling back to half fp16 for code that could use the fp32 directly.

Just brainstorming, but perhaps there could also be an intermediate sin/cos routine that computes low precision s,c with the SFU instructions, then uses a linear or quadratic correction on each. Most of the SFU’s deviation is from quantization of the table. There’s a step-wise “dx” for any input value relating its deviation from the “central” x value the SFU uses in its table, and that dx is just the low bits of the input argument the SFU table is ignoring. The deviation of the SFU sine approximation for example, is likely very similar to dxcos(x) and similarly deviation of cosine is likely similar to -dxsin(x). So the pairwise low-precision approximate values could help “correct” each other. Likely this wouldn’t get you down to small ULP, but it might reduce the SFU error from ~10 bits to ~5 bits (I’m totally guessing there). Performance would give some intermediate in speed/quality tradeoff between the raw SFU evaluation and your excellent 1 ULP sincosf FMA routine.

Of course then you have to ask if there’s any NEED for an approximate sin/cos pair with such intermediate error. Generating gaussian PRNG samples by Box Mueller might be a good example since that needs a sin/cos sample but those values are quite error-tolerant as it’s convolved with a very very wide kernel afterwards.

Right now I am just trying out some basic ideas, in particular use of SFU vs native FP16 computation. I don’t have a platform with full FP16 throughput where performance measurements would make sense. Comparing my optimized FP32 sincosf() with the more accurate variant of FP16 sincossf(), I count 30 instructions for the fast path versus 18. Considering that MUFU instructions have lower throughput, this may represent a speedup of 1.5x to 1.6x.

Other than for FP16 denormals, the quantization effect in MUFU.SIN (from the internal fixed-point computation) does not seem to have much of an effect on the code above. I am a bit bummed out by the bias issue though, and it seems expensive to correct no matter how one goes about it. You are completely correct that there are multiple ways in which one can attempt a correction, but I don’t feel like spending any more time on that at the moment.

Implementing FP16 math functions via SFU instructions has some interesting consequences, e.g. one can compute tanh() with excellent accuracy in fewer instructions than sincos()!

I was surprised that I could not find any investigations into FP16 math functions on the internet, given how FP16 is being hyped. If someone has good pointers, I would love to see them. I am personally very skeptical about doing computations in FP16 (as opposed to using FP16 strictly for storage, which makes a lot of sense to me), but it never hurts to do a bit of investigative tinkering.

Since I already mentioned an efficient half-precision implementation of tanh(), I might as well post it. I have two variants each of which compiles to 10 instructions for an sm_50 target. The maximum deviation from correctly rounded half-precision results is 1 ulp, and the vast majority of results are in fact correctly rounded. This goes to show that SFU instructions combined with straightforward mathematical formulas can deliver efficient and accurate results for half-precision math functions.

/*
  Copyright (c) 2016, Norbert Juffa
  All rights reserved.

  Redistribution and use in source and binary forms, with or without 
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright 
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
__forceinline__ __device__ float raw_ex2 (float a)
{
    float r;
    asm ("ex2.approx.ftz.f32 %0,%1;" : "=f"(r) : "f"(a));
    return r;
}

__forceinline__ __device__ float raw_rcp (float a)
{
    float r;
    asm ("rcp.approx.ftz.f32 %0,%1;" : "=f"(r) : "f"(a));
    return r;
}

__forceinline__ __device__ float copysignf_pos (float a, float b)
{
    float r;
    r = __int_as_float (__float_as_int (a) | (__float_as_int (b) & 0x80000000));
    return r;
}

__device__ __fp16 my_tanhsf (__fp16 ah)
{
    float a, e, r, t, d;
    const float L2E = 1.442695041f;

    a = __half2float (ah);
#if VERSION == 0
    t = L2E * 2.0f * a;
    t = __int_as_float (__float_as_int (t) | 0x80000000); // force negative
    e = raw_ex2 (t);
    d = e + 1.0f;
    r = raw_rcp (d);
    r = fmaf (e, -r, r);
    r = copysignf_pos (r, a);
#elif VERSION == 1
    t = -L2E * 2.0f * fabsf (a);
    e = raw_ex2 (t);
    d = e + 1.0f;
    r = raw_rcp (d);
    r = fmaf (e, -r, r);
    r = copysignf_pos (r, a);
#else
#error unsupported VERSION
#endif   
    return __float2half_rn (r);
}

The generated sm_50 SASS should look like this:

version 0
---------
1              F2F.F32.F16 R4, R2;
2              FMUL32I R5, R4, 2.885390043258667;
3              LOP32I.OR R5, R5, 0x80000000;
4              RRO.EX2 R8, R5;
5              MUFU.EX2 R5, R8;
6              FADD R6, R5, 1;
7              MUFU.RCP R6, R6;
8              FFMA R2, R5, -R6, R6;
9              LOP3.LUT R0, R2, c[0x2][0x0], R4, 0xf8;
10             F2F.F16.F32 R4, R0;      

version 1
---------
1              F2F.F32.F16 R4, R2;
2              FADD R5, |R4|, -RZ;
3              FMUL32I R5, R5, -2.885390043258667;
4              RRO.EX2 R8, R5;
5              MUFU.EX2 R5, R8;
6              FADD R6, R5, 1;
7              MUFU.RCP R6, R6;
8              FFMA R2, R5, -R6, R6;
9              LOP3.LUT R0, R2, c[0x2][0x0], R4, 0xf8;
10             F2F.F16.F32 R4, R0;

The SFU in modern Intel x86 chips seems similar, with around 12 bits of precision.
These take only a couple clocks, as compared to a full 1ULP evaluation taking on the order of 30 clocks.

But, interestingly, Agner Fog posted just a few weeks ago about Intel’s new x86 Knight’s Landing chip’s SFU. He found that its evaluations for sqrt(), rsqrt(), log2, and exp2 were all good all the way to 1 ULP for floats. His instruction tables (on his extensive site) haven’t been updated with Knights Landing timings yet, but he says these SFU evaluations are “just a few clock cycles.”

Agner Fog writes
The Knights Landing has an instruction set extension, AVX512ER, with some quite impressive math instructions. It can calculate a reciprocal, a reciprocal square root, and even an exponential function, on a vector of 16 floats in just a few clock cycles. The manual has a somewhat confusing description of the accuracy of these instructions. My measurements showed that these instructions are accurate to the last bit on single precision floats, while they give only approximate results for double precision.

I first became aware of Intel’s new SFU about three years ago, I think it was just a specification for AVX512 at the time, no shipping hardware yet. I don’t recall all functions being below 1 ulp though. Let me see whether I can find that specification again.

In retrospect, from the perspective of general purpose computing, some of the functions in NVIDIA’s SFU should have been designed with slightly smaller error. But I guess when the SFU was designed the focus was still exclusively graphics, and there is some advantage to not changing the SFU implementation ever since it shipped a decade ago.

As one example, the relatively large error of MUFU.EX2 has a negative impact on several single-precision math functions in CUDA which use that instructions. Intel’s SFU provides a more accurate exp2 implementation. I have to assume Intel included CUDA in their competitive analysis and found where they should improve their design versus NVIDIA’s.

[Later:] Right now I can only find the announcement by James Reinders from 2013 where the ERI (“exponential and reciprocal”) extension was first mentioned: https://software.intel.com/en-us/blogs/2013/avx-512-instructions.

Probably there are certain SFU qualities that are good bang/buck tradeoff points. For example an SFU table for say rsqrt() that’s just good enough such that one Newton refinement step brings all values to 1ULP. I wonder if the SFU behavior in NV GPUs has changed much if at all since GT200.

Ah, here’s more information about AVX-512 SFU features. The SFU approximations are 28 bits (more than float’s 24 bits of mantissa!), enough that the error is not only less than 1ULP, but is correctly rounded. And it handles exceptions!

It looks like there are also new operations just for interpolating small tables, which could be really interesting for implementing other transendental functions in software at various precisions.

As far as I am aware, there have been no functional changes to the GPU’s SFU since G80, the first CUDA enabled GPU. At one point the SASS instructions were renamed from SFU.* to MUFU.*, but the results for the various instructions have remained bitwise identical.

There are various conflicting statements about the accuracy of Intel’s SFU instructions to be found on the internet, including Intel documents, so I would consider the available data unreliable at present. The sources do seem to be consistent enough to let one draw the conclusion that the exp2 implementation is faithfully rounded.

Among the more useful instructions that Intel added are instructions that are basically the hardware equivalents of ldexpf() and frexpf(). I believe AMD GPUs also provide dedicated instructions for this, they can save a bunch of instructions in all kind of math functions. I would hope that NVIDIA gets around to implementing equivalent functionality at some point, but the performance pressure may not be there yet to justify the expenditure.

I recently came across an interesting little paper that suggests computing the error function via the hyperbolic tangent. Since these are both sigmoid functions, this can be accomplished by a straightforward polynomial transformation of the arguments, at least for low precision.

John D. Vedder, “Simple approximations for the error function and its inverse.” American Journal of Physics, Vol. 55, No. 8, August 1987, p. 762

This approach turns out to work well for half precision using the code for half-precision tanh() and associated helper functions I posted in #4 above. The resulting half-precision erf() implementation delivers results that deviate by at most 1 ulp from correctly rounded reference results. How this code compares performance-wise against CUDA’s built-in implementation I haven’t checked. I would expect it to be reasonably competitive.

/*
  Copyright (c) 2019, Norbert Juffa
  All rights reserved.

  Redistribution and use in source and binary forms, with or without 
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright 
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

/* Compute erf() via tanh(). This approach is suggested by the following paper:
   J. D. Vedder, "Simple approximations for the error function and its inverse",
   American Journal of Physics, Vol. 55, No. 8, August 1987, p. 762
*/
__device__ __fp16 my_erfsf (__fp16 ah)
{
    float a, e, r, t, d;
    const float L2E = 1.442695041f;

    a = __half2float (ah);

    // transform erf argument into tanh argument
    a = fmaf (0.101953857f, a * a, 1.12837899f) * a;

    // compute tanh of transformed argument
    t = -L2E * 2.0f * fabsf (a);
    e = raw_ex2 (t);
    d = e + 1.0f;
    r = raw_rcp (d);
    r = fmaf (e, -r, r);
    r = copysignf_pos (r, a);

    return __float2half_rn (r);
}