sincospif() implementation with improved performance and accuracy

Regular participants in these forums may have noticed that I am somewhat of a fan of the sinpi(), cospi(), sincospi() functions. These are applicable whenever an algorithm calls for the computation sin(π*[expr]) or cos(π*[expr]), which is surprisingly often. And what’s not to like: Compared with sin(), cos(), and sincos() performance and accuracy are improved, while code size and register pressure are reduced.

Recently I felt compelled to put some work in trying to find the ultimate single-precision implementation of sincospi(), and while I cannot be sure I achieved that I was able to get some meaningful improvements in speed and accuracy compared to CUDA 7.5.

On a Quadro K2200, I measured CUDA 7.5 throughput as 19.342e9 functions calls per second, while the code below clocks in at 20.728e9 function calls per second, for a 7% speedup. The accuracy is also improved, as the maximum error in sincospif() is 1.53358 ulps in CUDA 7.5 (for both the sine and cosine components), while the maximum error for my_sincospif() below is 0.96677 ulps for the sine component and 0.96563 ulp for the cosine component, meaning the results are faithfully rounded.

/*
  Copyright (c) 2016, Norbert Juffa
  All rights reserved.

  Redistribution and use in source and binary forms, with or without 
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright 
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

/* Argument reduction for sinpi, cospi, sincospi. Reduces to [-0.25, +0.25] */
__device__ __forceinline__ float trig_red_pi_f (float a, int *i)
{
    float r;

    r = rintf (a + a);
    *i = (int)r;
    r = a - 0.5f * r;
    return r;
}

/* Approximate cos(pi*x) for x in [-0.25,0.25]. Maximum ulp error = 0.87440 */
__device__ __forceinline__ float cospif_poly (float s)
{
    float r;

    r =              2.31384277e-1f;  //  0x1.d9e000p-3f
    r = fmaf (r, s, -1.33502197e+0f); // -0x1.55c400p+0f
    r = fmaf (r, s,  4.05870390e+0f); //  0x1.03c1cep+2f
    r = fmaf (r, s, -4.93480206e+0f); // -0x1.3bd3ccp+2f
    r = fmaf (r, s,  1.00000000e+0f); //  0x1.000000p+0f
    return r;
}

/* Approximate sin(pi*x) for x in [-0.25,0.25]. Maximum ulp error = 0.96677 */
__device__ __forceinline__ float sinpif_poly (float a, float s)
{
    float r;
    r =             -5.95703125e-1f;  // -0x1.310000p-1f
    r = fmaf (r, s,  2.55039954e+0f); //  0x1.46737ep+1f
    r = fmaf (r, s, -5.16772413e+0f); // -0x1.4abbfep+2f
    r = (a * s) * r;
    r = fmaf (a, 3.14159274e+0f, r);  //  0x1.921fb6p+1f
    return r;
}

/* Compute sin(x*pi) and cos(x*pi) simultaneously, based on the quadrant 
   Maximum error for sinpi is 0.96677 ulp; for cospi it is 0.96563 ulp
*/
__device__ __forceinline__ void sincospif_core (float a, int i, 
                                                float * __restrict__ sp, 
                                                float * __restrict__ cp)
{
    float c, s, t;

    s = a * a;
    c = cospif_poly (s);
    s = sinpif_poly (a, s);
    if (i & 2) {
        s = 0.0f - s; // don't change "sign" of NaNs or create negative zeros
        c = 0.0f - c; // don't change "sign" of NaNs or create negative zeros
    }
    if (i & 1) {
        t = 0.0f - s; // don't change "sign" of NaNs or create negative zeros
        s = c;
        c = t;
    }
    *sp = s;
    *cp = c;
}

__device__ void my_sincospif (float a, float *sp, float *cp)
{
    float c, r, s;
    int i;
    
    /* for |a| > 2**24, cospi(a) = 1.0f, but cospi(Inf) = NaN */
    a = (fabsf (a) < 16777216.0f) ? a : (a * 0.0f); // 0x1.0p24f
    r = trig_red_pi_f (a, &i);
    sincospif_core (r, i, &s, &c);
    /* IEEE-754: sinPi(+n) is +0 and sinPi(-n) is -0 for positive integers n */
    s = (a == floorf (a)) ? (a * 0.0f) : s;
    *sp = s;
    *cp = c;
}

I always love your nanooptimized math routines.

Your comment on line 40 confused me a bit since the cosine expansions are of course in even powers of x, so the argument s is actually x^2 with a domain of -1/16 to 1/16.

I’m happily surprised that the cosine can be so well represented over a quadrant by a mere 4th order polynomial in x^2, but it makes sense since it’s such a well behaved curve.

Maximum absolute error is probably at the endpoint of x=1/4, where the cosine is 0.0?

I didn’t bother to check on the core cosine approximation, since it is the sine approximation that is responsible for the overall maximum error. The worst case error for the core sine approximation occurs around x = 1e-6, if I recall correctly.

I spent a lot of time reducing the accuracy of the least significant coefficients in both approximations, so they can be represented by an immediate factor that is part of the instruction itself (piped forward to the functional units), rather than a constant memory reference. This should reduce energy expenditure. I am currently experimenting with that approach for all kinds of math functions.

kudos!

Thanks Norbert, I opened an RFE to integrate your function in an upcoming CUDA release.

I decided to learn a bit about these function fits and reproduce your results for the cosine lobe. It was fun figuring out how ULP error is computed… I first used a normalization of 1 ULP = difference of the value and the next largest representable value. But then I realized how sensitive that was at the edge of an exponent jump, so I changed it to 1 ULP is the absolute difference of the value with the value with low bit flipped and exponent unchanged. That seems like a more robust definition.

I confirmed your reported worst case error of 0.87440… the error range is (-0.874403, 0.874094) ULP.

Then for fun I took the cosine fit coefficients and made a brute force loop trying random perturbations. I couldn’t improve on the absolute worst case, but after a few hours of CPU cooking I did tighten the maximum error range by a tiny fraction to (-0.874403, 0.872760) using these leading coefficients:

0.2313227803f
-1.3350331783f
4.0587048531f
-4.9348020554f

Such a trivial error range improvement is not really useful… but it did satisfy my OCD optimization itch.

I optimized the polynomials using a heuristic with a good deal of randomness thrown in as well, to avoid local minima. My primary optimization target is the absolute value of the maximum ulp error (so 0.874403 in this case), and a secondary optimization target is maximizing the number of correctly rounded results.

The two optimization targets often have a bit of an inverse relationship: As ulp error goes down, the number of correctly-rounded results will often decline by a few percent at the same time. I have never come across a formalized discussion of this phenomenon. My hand-wavy explanation is that as the error curve becomes less “peaky” (so ulp error reduces), the area under the curve increases (likelihood of incorrectly-rounded results increases).

One can likely find even better approximations, but there isn’t enough research in this area, and little incentive for anyone to pay for such research. I believe Tor Myklebust’s approach (http://arxiv.org/abs/1508.03211) is a step in the right direction, but it doesn’t really work for double precision and neither do my own methods.

1 Like

Ah, the secondary goals decision is interesting. I also experimented with a different rule set that first minimized absolute worst case error, with a secondary tiebreaker reducing RMS relative error averaged over all bit patterns. I was able to improve that, too, but only by an even more trivial fraction. That RMS goal function probably behaves very similarly to the correct rounding count since all errors are less than 1 ulp.

What’s your timing benchmark? Likely a kernel where a thread calls the function 1000 times in an unrolled loop, accumulating it in a sum so the compiler doesn’t kill the function as unused, and you measure the net kernel time using events?

In this case I used repetitions of the following building block for performance measurements:

func (t, &s, &c); t = s * c;

where the starting value of ‘t’ is a random number around 0.1. I didn’t bother to subtract out the tim taken by the extra multiply.