A faster and more accurate implementation of sincosf()

A recent question on Stackoverflow reminded me that sincos() is one of those functions that can never be fast enough. Below is an implementation of single-precision sincosf() that boosts fast-path throughput by 14.7% compared to CUDA 7.5 on my Quadro K2200 (sm_50). It is also slightly more accurate, the fast-path interval has been widened considerably, and the code footprint is reduced compared to the CUDA 7.5 implementation.

Specifically, in my measurements, the fast-path throughput on the K2200 increases from 16.468 billion function evaluations per second with CUDA 7.5 to 18.889 billion function evaluations per second. The maximum error of the sine component improves from 1.54185 ulps with CUDA 7.5 to 1.49241 ulps, and the maximum error in the cosine component decreases from 1.53961 ulps with CUDA 7.5 to 1.49510 ulps (this also means that results never differ from correctly rounded results by more than 1 ulp).

Side note: If you want to see some truly abominable code generation by the CUDA compiler, enable the section guarded by #if SANE_COMPILER. The main problem seems to be that nvvm emits a bfe.u64 instruction, which, having not hardware counterpart, maps to emulation code consisting of a called subroutine at SASS level. Yikes!

[Code below modified 9/2/2016, 12/18/2016, 11/23/2018, 9/22/2019]

/*
  Copyright (c) 2016-2019 Norbert Juffa
  All rights reserved.

  Redistribution and use in source and binary forms, with or without 
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright 
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

/* 191 bits of 2/PI for Payne-Hanek style argument reduction. */
static __constant__ unsigned int two_over_pi_f [] = 
{
    0x517cc1b7,
    0x27220a94,
    0xfe13abe8,
    0xfa9a6ee0,
    0x6db14acc,
    0x9e21c820
};

__forceinline__ __device__ float trig_red_slowpath_f (float a, int *quadrant)
{
    unsigned long long int p;
    unsigned int ia, hi, mid, lo, tmp, i;
    int e, q;
    float r;

#if PORTABLE
    ia = (unsigned int)(fabsf (frexpf (a, &e)) * 4.29496730e+9f); // 0x1.0p32f
    e = e - 1;
#else // !PORTABLE
    ia = __int_as_float ((__float_as_int (a) & 0x007fffff) + 0x4f000000);
    e = (((unsigned int)__float_as_int (a) >> 23) & 0xff) - 127;
#endif // !PORTABLE
    
    /* compute product x * 2/pi in 2.62 fixed-point format */
    i = (unsigned int)e >> 5;
    e = (unsigned int)e & 31;

    hi  = i ? two_over_pi_f [i-1] : 0;
    mid = two_over_pi_f [i+0];
    lo  = two_over_pi_f [i+1];
    tmp = two_over_pi_f [i+2];
 
#if __CUDA_ARCH__ < 320
    if (e) {
        hi  = (hi  << e) | (mid >> (32 - e));
        mid = (mid << e) | (lo  >> (32 - e));
        lo  = (lo  << e) | (tmp >> (32 - e));
    }
#else // __CUDA_ARCH__ >= 320
    asm ("{\n\t"
         "shf.l.clamp.b32  %2, %1, %2, %3;\n\t"
         "shf.l.clamp.b32  %1, %0, %1, %3;\n\t"
         "shf.l.clamp.b32  %0, %4, %0, %3;\n\t"
         "}\n\t"
         : "+r"(lo), "+r"(mid), "+r"(hi) : "r"(e), "r"(tmp));
#endif // __CUDA_ARCH__ >= 320

    p = (unsigned long long int)ia * lo;
    p = (unsigned long long int)ia * mid + (p >> 32);
    p = ((unsigned long long int)(ia * hi) << 32) + p;

#if SANE_COMPILER
    q = (int)(p >> 62);                // integral portion = quadrant<1:0>
    p = p & 0x3fffffffffffffffULL;     // fraction
    if (p & 0x2000000000000000ULL) {   // fraction >= 0.5
        p = p - 0x4000000000000000ULL; // fraction - 1.0
        q = q + 1;
    }
#else // !SANE_COMPILER
    unsigned int phi, plo;
    asm ("mov.b64 {%0,%1}, %2;" : "=r"(plo), "=r"(phi) : "l"(p));
    q = phi >> 30;
    phi = phi & 0x3fffffff;
    if (phi & 0x20000000) {
        phi = phi - 0x40000000;
        q = q + 1;
    }
    asm ("mov.b64 %0, {%1,%2};" : "=l"(p) : "r"(plo), "r"(phi));
#endif // !SANE_COMPILER

    /* compute remainder of x / (pi/2) */
    double d;

    d = (double)(long long int)p;
    d = d * 3.4061215800865545e-19; // 0x1.921fb54442d18p-62 // pi/2 * 0x1.p-62
    r = (float)d;

    if (a < 0.0f) {
        r = -r;
        q = -q;
    }

    *quadrant = q;
    return r;
}

/* Argument reduction for trigonometric functions that reduces the argument
   to the interval [-PI/4, +PI/4] and also returns the quadrant. It returns 
   -0.0f for an input of -0.0f 
*/
__forceinline__ __device__ float trig_red_f (float a, float switch_over, int *q)
{    
    float j, r;

    /* FMA-enhanced Cody-Waite style reduction. W. J. Cody and W. Waite, 
       "Software Manual for the Elementary Functions", Prentice-Hall 1980
    */
    j = fmaf (a, 0.636619747f, 12582912.0f); // 2/pi, 0x1.8p+23
    *q = __float_as_int (j);
    j = j - 12582912.0f; // 0x1.8p+23
    r = fmaf (j, -1.57079601e+00f, a); // -0x1.921fb0p+00 // pio2_high
    r = fmaf (j, -3.13916473e-07f, r); // -0x1.5110b4p-22 // pio2_mid
    r = fmaf (j, -5.39030253e-15f, r); // -0x1.846988p-48 // pio2_low
    if (fabsf (a) > switch_over) {
        /* Payne-Hanek style reduction. M. Payne and R. Hanek, Radian reduction
           for trigonometric functions. SIGNUM Newsletter, 18:19-24, 1983
        */
        r = trig_red_slowpath_f (a, q);
    }
    return r;
}

/* Approximate sine on [-PI/4,+PI/4]. Maximum ulp error = 0.64196
   Returns -0.0f for an argument of -0.0f
   Polynomial approximation based on T. Myklebust, "Computing accurate 
   Horner form approximations to special functions in finite precision
   arithmetic", http://arxiv.org/abs/1508.03211, retrieved on 8/29/2016
*/
__forceinline__ __device__ float sinf_poly (float a, float s)
{
    float r, t;
    r =              2.86567956e-6f;  //  0x1.80a000p-19
    r = fmaf (r, s, -1.98559923e-4f); // -0x1.a0690cp-13
    r = fmaf (r, s,  8.33338592e-3f); //  0x1.111182p-07
    r = fmaf (r, s, -1.66666672e-1f); // -0x1.555556p-03
    t = fmaf (a, s, 0.0f); // ensure -0 is passed trough
    r = fmaf (r, t, a);
    return r;
}

/* Approximate cosine on [-PI/4,+PI/4]. Maximum ulp error = 0.87444 */
__forceinline__ __device__ float cosf_poly (float s)
{
    float r;
    r =              2.44677067e-5f;  //  0x1.9a8000p-16
    r = fmaf (r, s, -1.38877297e-3f); // -0x1.6c0efap-10
    r = fmaf (r, s,  4.16666567e-2f); //  0x1.555550p-05
    r = fmaf (r, s, -5.00000000e-1f); // -0x1.000000p-01
    r = fmaf (r, s,  1.00000000e+0f); //  0x1.000000p+00
    return r;
}

/* Compute sine and cosine simultaneously, based on quadrant */
__forceinline__ __device__ void scf_core (float a, int i, float *sp, float *cp)
{
    float c, s, t;

    s = a * a;
    c = cosf_poly (s);
    s = sinf_poly (a, s);
    if (i & 2) {
        s = 0.0f - s; // don't change "sign" of NaNs or create negative zeros
        c = 0.0f - c; // don't change "sign" of NaNs or create negative zeros
    }
    if (i & 1) {
        t = 0.0f - s; // don't change "sign" of NaNs or create negative zeros
        s = c;
        c = t;
    }
    *sp = s;
    *cp = c;
}

/* maximum ulp error sin = 1.49241, maximum ulp error cos = 1.49510 */
__device__ void my_sincosf (float a, float *sp, float *cp)
{
    float r;
    int i;

    a = a * 0.0f + a; // inf -> NaN
    r = trig_red_f (a, 71476.0625f, &i);
    scf_core (r, i, sp, cp);
}

:plus1:

Can you submit this code to Nvidia somehow?

I already created the RFE (bug 1807633), thanks Norbert.

@MutantJohn: While I used to be an avid proponent of submitting RFEs to NVIDIA, the fact that the company seems to be unable to push out a new (non-RC) version of CUDA a whopping twelve months after it shipped CUDA 7.5 and a couple of months after introduction of a new hardware generation has put serious doubts into my mind as to how useful this would be. So for now I am just sharing my code with whoever might find it useful, under a proper OSI-approved license so it can go into any kind of project.

@mfatica: Thanks.

Always look forward to your improved math function implementations. Thanks for posting.

I tried this in a custom multi-batch DFT/convolution implementation and noticed that the register usage increased by 25% from the switch to my_sincosf() from __sincosf(). This unfortunately reduced occupancy so no performance boost for this specific case.

Could that possibly have anything to do with the ‘fast-path’ and ‘slow path’ conditional branching?

Argh. Try changing from noinline to forceinline for the slow path. One really would prefer noinline here so if there are multiple calls to sincosf() we do not wind up with multiple copies of the bulky slowpath. Use of noinline looked OK in my testing.

I assume that you did not #define PORTABLE or SANE_COMPILER, and built the code with CUDA 7.5.

I wonder whether the computation of “p” in the slowpath could be at fault, the CUDA compiler does some pretty stupid things for 64-bit integer computations at times. I hope I will not have to re-write that into custom inline PTX.

In my test framework, use of forceinline instead of noinline for the slowpath code decreases the register usage by five registers. So clearly forceinline is the way to go, the resulting code is still smaller than CUDA 7.5. The register pressure with forceinline is on par with CUDA 7.5, in fact in this particular context my code uses one register less (on sm_50, using my_sincosf() 18 registers are used by the kernel, vs 19 registers with CUDA 7.5 sincosf()).

I have updated the posted code to use forceinline for the slowpath. Can you re-run your test case to check whether the register pressure issue goes away? Thanks!

That did make a difference when compared to your original version, but that particular kernel’s register usage still increased.

I am using CUDA 7.5 and compiling for 5.2.

Using the __sincosf() from the CUDA library this is the compile output for that kernel;

1>  ptxas info    : Compiling entry function '_Z41ramp_filter_3_padding_complex_Dalsa_multiP6float2PKS_ii' for 'sm_52'
1>  ptxas info    : Function properties for _Z41ramp_filter_3_padding_complex_Dalsa_multiP6float2PKS_ii
1>      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>  ptxas info    : Used 40 registers, 13784 bytes smem, 344 bytes cmem[0]

Now using your updated version;

1>  ptxas info    : Compiling entry function '_Z41ramp_filter_3_padding_complex_Dalsa_multiP6float2PKS_ii' for 'sm_52'
1>  ptxas info    : Function properties for _Z41ramp_filter_3_padding_complex_Dalsa_multiP6float2PKS_ii
1>      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>  ptxas info    : Used 54 registers, 13784 bytes smem, 344 bytes cmem[0], 56 bytes cmem[2]

It is a wonky kernel so maybe there is something else contributing to the issue.

I honestly have no idea why there would be this huge 14 register difference. That is basically the entire register budget needed for my_sincosf() itself. I will check what happens if I compile for sm_52 instead of sm_50.

Wait! Did you write __sincosf() by accident, or is this what your code actually uses? __sincosf() is an intrinsic device function that maps directly to special function unit instructions, it is basically three instructions: RRO.SIN, MUFU.SIN, MUFU.COS.

my_sincosf() is a replacement for the standard math function sincosf(), there is no way either sincosf() or my_sincosf() are going to compete with the much less accurate __sincosf() intrinsic.

You are correct…
I am using the less accurate __sincosf() intrinsic function that is described here;

http://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__SINGLE.html#group__CUDA__MATH__INTRINSIC__SINGLE_1gb5357a39a32653bbb1ffde7748eaa521

This specific use case does not require high-accuracy so that intrinsic function can be used.
Sorry for not making this more clear…

Glad we got that sorted out. Bringing up potential register pressure issues with my_sincosf() was helpful, so thank you for that. With the slow path code now inlined, the function’s throughput improves by another 6% or so on my K2200, which is welcome news.

Norbert, you mentioned before that your secondary optimization goal, after minimizing absolute worst case ULP, was to maximize the number of correctly rounded values.

For the very specific case of the sin/cos pair evaluation, perhaps the secondary goal might be better used to optimize the PAIRWISE accuracy, specifically the expected normalization of s^2+c^2==1. If a user is asking for a sin/cos pair, then that normalization relationship is likely relevant.

Of course this kind of pico-optimization is already almost picayune nitpicking, but it’s satisfying to know “this evaluation cannot be any better.” But you have to define “better”.

I bring this up because of a similar nit-picking optimization I did for my CPU Monte Carlo work. I have pregenerated lookup tables for various spatial diffusion kernels, and the very simplest, almost reference, is just a circular or spherical set of points distributed evenly in a way to zero the higher order multipole moments of each. I found a nagging, almost imperceptable, error in the simplest sin/cos table (1024 sin/cos samples evenly spaced in a circle!) and tracked it down to the fact that computing sin/cos in double precision and rounding to float will break the s^2+c^2==1 invariant (often by hundreds of ULP), but my test harness is looking for such multipole invariants as a quality check and it was (correctly) identifying this as an unexpected (but really tiny) bias.

You bring up an interesting aspect specific to combined sine/cosine computation that I have never considered. Would it be correct to say that your proposal is equivalent to computing the residual 1-sin2-cos2 for each test vector, then minimizing the sum of the absolute value of these residuals over all test cases?

I will look into it to see whether that makes sense as a tertiary optimization goal (until there is stronger evidence that it should be a secondary optimization goal). I suspect that it may only be useful for faithfully-rounded implementations of sine and cosine (that would require a slower argument reduction returning a double-float in head:tail fashion).

As you say you have to decide the quality goal prioritization. In some cases, the ANGULAR accuracy may be most important (how far off is the actual atan2(s,c) angle from the original input angle?) But minimizing ULP maximum error of each of the sine and cosine evaluations in isolation should be the primary. When you have secondary goals like pairwise angle or magnitude error, you also need to decide if you’re reducing the maximum error of those, or an RMS average error. I’ll argue for minimizing RMS magnitude, but that’s from my own application. Whatever the secondary goal is, though, it’s an awfully trivial correction.

Those are all good points. I have a pretty good overview of the relevant literature of the past 25 years, and no paper comes to mind that has looked into these issues, whether from a practical or a theoretical perspective. I vaguely recall a paper by Japanese astronomers about subtle phase shifts [??] in double-precision implementation of trigonometric functions causing trouble, which affected extremely long-range simulations.

[Later:] I managed to track down the paper I was thinking of: Takashi Ito and Sadamu Kojima, “Inaccuracies of Trigonometric Functions in Computer Mathematical Libraries”, Publ. Natl. Astron. Obs. Japan Vol. 8. 17-31 (2005) (http://www.nao.ac.jp/contents/about-naoj/reports/publications-naoj/8-14.pdf). The issues discussed in the paper seem to be very similar, if not identical to, your accuracy concerns.

There is even a brief follow-up paper I was unaware of until a minute ago: N.A. Shupikova and G.O. Ryabova, “Once again on inaccuracies of trigonometric functions in computer mathematical libraries”, EPSC Abstracts, Vol. 4, EPSC2009-275, 2009 (http://meetingorganizer.copernicus.org/EPSC2009/EPSC2009-275.pdf)

Norbert… yet again I’m stunned by your depth of floating point knowlege. I laughed out loud at your over-modest claim to have a “pretty good overview of the relevant literature of the past 25 years”.

That Japanese paper is indeed apropos. When using the sin/cos pair as the entries of a 2D transformation matrix, that matrix is often repeatedly applied many times to a coordinate, so any non-unity sin/cos normalization will blow up (if >1) or decay to 0 (if <1). That’s a much more dramatic example than my own Greens function moment measurements, and a strong argument for optimizing normalization quality over angular.

The second paper is also supportive, but the Japanese example is far more powerful since it’s an actual application that had measurable errors tracked down to the sin/cos pairwise quality.

For my single precision Greens function tables I just made an ad-hoc one-time correction postprocess. I took each (s,c) pair, accepted the larger in magnitude value as correct, then recalculated the other value as the single precision value that provided the best s,c normalization.

Wow, that resource of papers is fantastic!

Thanks njuffa for sharing your knowledge and thanks SPWorley for bringing attention to this resource.

SPWorley, I had the same thoughts about Norbert’s bibliography and knowledge. I think this modesty is just what signifies really great people.

Regarding your post-processing of (sin(x), cos(x)) pairs, I am a bit confused. Why are you accepting the larger value in magnitude as correct?
I would expect the smaller value to have better accuracy. Particularly with the 1-x^2/2+… expansion around the maxima having no linear term, trying to deduce the smaller from the larger of the two functions seems particularly susceptible to rounding error.

In my case (which is an offline preprocess for a table, not a runtime FMA Horner rule evaluation) the best unit (sine, cosine) normalization magnitude is important. Both components start with < 1 ULP error, so they’re both as accurate as possible in isolation. But I’m willing to let them deviate slightly to improve their pairwise unit magnitude. Varying the smaller of the two gives more control over that norm value since dMag/dv is smaller. Ideally I’d vary both and use normalization as the primary goal and angular accuracy as secondary, but that would likely give me a similar answer.

Here’s an example assuming two decimal digits of precision. If I were making a s,c table value for the angle 80 degrees, I would get the initial < 1ULP sine/cosine pair of (0.98, 0.17) with net magnitude 0.9946. If I lock the smaller cosine value at 0.17, the best sine complement is best adjusted to 0.99 (ie, 0.98 and 1.00 both give WORSE normalization), and the magnitude is slightly improved to 1.0045. If I lock the sine value of 0.98, the best cosine complement is 0.20, giving a far better magnitude of 1.0002. So the original normalization residual error was 0.0054, adjusting the larger value gives an improved 0.0045, but adjusting the smaller is an order of magnitude better with just 0.0002 residual.

There are definitely cases where the larger value is better to change, especially when the values are close in magnitude to start with. The lower dMag/dv means adjusting the smaller gives finer control in the majority of cases.

Ah I see. I was just worried that you would need to modify the smaller value by several ULP, so that now the atan2 would also be several ULP off. But if the norm is your topmost concern, it makes sense.

Sounds like a good use case for quaternions, but this obviously depends on the concrete problem.