Improving the accuracy of the __sincosf-function

For large arguments, the error of the __sincosf-function becomes
significant (the absolute error is about 4e-4 for arguments of the
order of 1e4).

The error seem to be linear in the argument, and be a 90-degree
phase-shifted wave.

Replacing the code:

   __sincosf(phi,&s,&c);

with:

  const float correction = 4.065371168387674e-08f;
  __sincosf(phi,&s1,&c1);
  t = t*correction;
  s = s1 + t*c1;
  c = c1 - t*s1;

improves the maximum absolute error in he range [-1e4,1e4] bu a
factor of about 100. The cost 3 is instrucitons (one mul and two
fma’s). The
maximum absolute error in the [-1e4,1e4] range for the improved code
is about 3.5e-6. The absolute error for small ranges is essentially
unaffected, since the correction is linear.

The resulting error of the improved code does not seem to grow with
the argument (phi), as is the case for the uncorrected __sincosf.

This has been tested on the G80 chip (a 8800GT, a 8800GTS and a
8800GTX with identical results).

What does the variable t mean in your code?

For one more operation, you could perform a Cody-Waite range reduction. Something like:

float x = phi;

float const C1 = 6.28125f;

float const C2 = 1.93530718e-3f;

float k = rintf(x * float(2*M_1_PI));

x_r = (x - k * C1) - k * C2;

__sincosf(x_r, &s1, &c1);

(or 5 operations, as rintf may need 2 instructions)

It should be much more accurate for arguments up to around 25000.

By the way, why can’t you use the sincosf() function?

What does the variable t mean in your code?

Sorry. the second t is supposed to be phi, e.g.:

[indent]

     __sincosf(phi,&c1,&s1);

 t = correction*phi;

     s = s1 + t*c1;

     c = c1 - t*s1;

[/indent]

By the way, why can’t you use the sincosf() function?

I first had one sin() and one cos(). Switching to __sincosf() saved 10 registers and

the entire code (including other calculations) was 6 times faster. My idea was to use

the hardware (but low accuarcy) functions. Since the error for large arguments was

undocumented I mapped it out. The realization that the error signal was just a phase

shifted wave of linear amplitude lead to this trick. The improvement above turned out

to be good enough, and very cheap.

For one more operation, you could perform a Cody-Waite range reduction. […]

I appreciate you tip. That is a good option, I will keep it in mind for later.

Interesting.

Indeed, the hardware sin/cos operator is not so bad, except for small and large arguments.

If phi gets very small in magnitude (say, less than 10^-5), then __sin(phi) will return 0. In this case, just returning phi is a very accurate alternative.
(testing the absolute value and comparing it to a constant is done in one instruction, so it will add two extra instructions at most).
But it may not matter in your case if you are just interested in minimizing the absolute error, not the relative error.

One more question: if CUDA provided a fast and accurate __sincospi(x) function that computes sin(pi * x) and cos(pi * x), would you be able to rewrite your code to take advantage of it?
(that is, does phi really needs to be in radians?)