sincospif() implementation with improved performance and accuracy

njuffa · August 14, 2016, 11:50pm

Regular participants in these forums may have noticed that I am somewhat of a fan of the sinpi(), cospi(), sincospi() functions. These are applicable whenever an algorithm calls for the computation sin(π*[expr]) or cos(π*[expr]), which is surprisingly often. And what’s not to like: Compared with sin(), cos(), and sincos() performance and accuracy are improved, while code size and register pressure are reduced.

Recently I felt compelled to put some work in trying to find the ultimate single-precision implementation of sincospi(), and while I cannot be sure I achieved that I was able to get some meaningful improvements in speed and accuracy compared to CUDA 7.5.

On a Quadro K2200, I measured CUDA 7.5 throughput as 19.342e9 functions calls per second, while the code below clocks in at 20.728e9 function calls per second, for a 7% speedup. The accuracy is also improved, as the maximum error in sincospif() is 1.53358 ulps in CUDA 7.5 (for both the sine and cosine components), while the maximum error for my_sincospif() below is 0.96677 ulps for the sine component and 0.96563 ulp for the cosine component, meaning the results are faithfully rounded.

/*
  Copyright (c) 2016, Norbert Juffa
  All rights reserved.

  Redistribution and use in source and binary forms, with or without 
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright 
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

/* Argument reduction for sinpi, cospi, sincospi. Reduces to [-0.25, +0.25] */
__device__ __forceinline__ float trig_red_pi_f (float a, int *i)
{
    float r;

    r = rintf (a + a);
    *i = (int)r;
    r = a - 0.5f * r;
    return r;
}

/* Approximate cos(pi*x) for x in [-0.25,0.25]. Maximum ulp error = 0.87440 */
__device__ __forceinline__ float cospif_poly (float s)
{
    float r;

    r =              2.31384277e-1f;  //  0x1.d9e000p-3f
    r = fmaf (r, s, -1.33502197e+0f); // -0x1.55c400p+0f
    r = fmaf (r, s,  4.05870390e+0f); //  0x1.03c1cep+2f
    r = fmaf (r, s, -4.93480206e+0f); // -0x1.3bd3ccp+2f
    r = fmaf (r, s,  1.00000000e+0f); //  0x1.000000p+0f
    return r;
}

/* Approximate sin(pi*x) for x in [-0.25,0.25]. Maximum ulp error = 0.96677 */
__device__ __forceinline__ float sinpif_poly (float a, float s)
{
    float r;
    r =             -5.95703125e-1f;  // -0x1.310000p-1f
    r = fmaf (r, s,  2.55039954e+0f); //  0x1.46737ep+1f
    r = fmaf (r, s, -5.16772413e+0f); // -0x1.4abbfep+2f
    r = (a * s) * r;
    r = fmaf (a, 3.14159274e+0f, r);  //  0x1.921fb6p+1f
    return r;
}

/* Compute sin(x*pi) and cos(x*pi) simultaneously, based on the quadrant 
   Maximum error for sinpi is 0.96677 ulp; for cospi it is 0.96563 ulp
*/
__device__ __forceinline__ void sincospif_core (float a, int i, 
                                                float * __restrict__ sp, 
                                                float * __restrict__ cp)
{
    float c, s, t;

    s = a * a;
    c = cospif_poly (s);
    s = sinpif_poly (a, s);
    if (i & 2) {
        s = 0.0f - s; // don't change "sign" of NaNs or create negative zeros
        c = 0.0f - c; // don't change "sign" of NaNs or create negative zeros
    }
    if (i & 1) {
        t = 0.0f - s; // don't change "sign" of NaNs or create negative zeros
        s = c;
        c = t;
    }
    *sp = s;
    *cp = c;
}

__device__ void my_sincospif (float a, float *sp, float *cp)
{
    float c, r, s;
    int i;
    
    /* for |a| > 2**24, cospi(a) = 1.0f, but cospi(Inf) = NaN */
    a = (fabsf (a) < 16777216.0f) ? a : (a * 0.0f); // 0x1.0p24f
    r = trig_red_pi_f (a, &i);
    sincospif_core (r, i, &s, &c);
    /* IEEE-754: sinPi(+n) is +0 and sinPi(-n) is -0 for positive integers n */
    s = (a == floorf (a)) ? (a * 0.0f) : s;
    *sp = s;
    *cp = c;
}

SPWorley · August 15, 2016, 3:53am

I always love your nanooptimized math routines.

Your comment on line 40 confused me a bit since the cosine expansions are of course in even powers of x, so the argument s is actually x^2 with a domain of -1/16 to 1/16.

I’m happily surprised that the cosine can be so well represented over a quadrant by a mere 4th order polynomial in x^2, but it makes sense since it’s such a well behaved curve.

Maximum absolute error is probably at the endpoint of x=1/4, where the cosine is 0.0?

njuffa · August 15, 2016, 4:27am

I didn’t bother to check on the core cosine approximation, since it is the sine approximation that is responsible for the overall maximum error. The worst case error for the core sine approximation occurs around x = 1e-6, if I recall correctly.

I spent a lot of time reducing the accuracy of the least significant coefficients in both approximations, so they can be represented by an immediate factor that is part of the instruction itself (piped forward to the functional units), rather than a constant memory reference. This should reduce energy expenditure. I am currently experimenting with that approach for all kinds of math functions.

Jimmy_Pettersson · August 15, 2016, 10:01am

kudos!

mfatica · August 15, 2016, 4:41pm

Thanks Norbert, I opened an RFE to integrate your function in an upcoming CUDA release.

SPWorley · August 15, 2016, 11:01pm

I decided to learn a bit about these function fits and reproduce your results for the cosine lobe. It was fun figuring out how ULP error is computed… I first used a normalization of 1 ULP = difference of the value and the next largest representable value. But then I realized how sensitive that was at the edge of an exponent jump, so I changed it to 1 ULP is the absolute difference of the value with the value with low bit flipped and exponent unchanged. That seems like a more robust definition.

I confirmed your reported worst case error of 0.87440… the error range is (-0.874403, 0.874094) ULP.

Then for fun I took the cosine fit coefficients and made a brute force loop trying random perturbations. I couldn’t improve on the absolute worst case, but after a few hours of CPU cooking I did tighten the maximum error range by a tiny fraction to (-0.874403, 0.872760) using these leading coefficients:

0.2313227803f
-1.3350331783f
4.0587048531f
-4.9348020554f

Such a trivial error range improvement is not really useful… but it did satisfy my OCD optimization itch.

njuffa · August 15, 2016, 11:16pm

I optimized the polynomials using a heuristic with a good deal of randomness thrown in as well, to avoid local minima. My primary optimization target is the absolute value of the maximum ulp error (so 0.874403 in this case), and a secondary optimization target is maximizing the number of correctly rounded results.

The two optimization targets often have a bit of an inverse relationship: As ulp error goes down, the number of correctly-rounded results will often decline by a few percent at the same time. I have never come across a formalized discussion of this phenomenon. My hand-wavy explanation is that as the error curve becomes less “peaky” (so ulp error reduces), the area under the curve increases (likelihood of incorrectly-rounded results increases).

One can likely find even better approximations, but there isn’t enough research in this area, and little incentive for anyone to pay for such research. I believe Tor Myklebust’s approach ([url]http://arxiv.org/abs/1508.03211[/url]) is a step in the right direction, but it doesn’t really work for double precision and neither do my own methods.

SPWorley · August 15, 2016, 11:50pm

Ah, the secondary goals decision is interesting. I also experimented with a different rule set that first minimized absolute worst case error, with a secondary tiebreaker reducing RMS relative error averaged over all bit patterns. I was able to improve that, too, but only by an even more trivial fraction. That RMS goal function probably behaves very similarly to the correct rounding count since all errors are less than 1 ulp.

What’s your timing benchmark? Likely a kernel where a thread calls the function 1000 times in an unrolled loop, accumulating it in a sum so the compiler doesn’t kill the function as unused, and you measure the net kernel time using events?

njuffa · August 16, 2016, 12:28am

In this case I used repetitions of the following building block for performance measurements:

func (t, &s, &c); t = s * c;

where the starting value of ‘t’ is a random number around 0.1. I didn’t bother to subtract out the tim taken by the extra multiply.

Topic		Replies	Views
A more accurate (and potentially faster) double-precision sincospi() implementation CUDA Programming and Performance	0	1684	August 12, 2015
A faster and more accurate implementation of sincosf() CUDA Programming and Performance	25	9116	August 6, 2017
Accuracy-optimized implementation of tanf(), without performance impact CUDA Programming and Performance	1	479	July 5, 2022
Faster and more accurate implementation of log1pf() CUDA Programming and Performance	15	3302	January 25, 2017
An implementation of single-precision tanpi() for CUDA CUDA Programming and Performance	1	782	July 26, 2017
Faster and more accurate implementation of logf() CUDA Programming and Performance	10	2127	August 7, 2024
__device__ and __host__ qualifiers in same function CUDA Programming and Performance	4	3241	February 20, 2012
Optimized version of single-precision error function, erff() CUDA Programming and Performance	21	4467	December 25, 2017
Improving the accuracy of the __sincosf-function CUDA Programming and Performance	3	4346	August 18, 2009
CUDA book by Kirk & Whu available CUDA Programming and Performance	44	12114	February 10, 2010

sincospif() implementation with improved performance and accuracy

Related topics