For large arguments, the error of the __sincosf-function becomes
significant (the absolute error is about 4e-4 for arguments of the
order of 1e4).
The error seem to be linear in the argument, and be a 90-degree
phase-shifted wave.
Replacing the code:
__sincosf(phi,&s,&c);
with:
const float correction = 4.065371168387674e-08f;
__sincosf(phi,&s1,&c1);
t = t*correction;
s = s1 + t*c1;
c = c1 - t*s1;
improves the maximum absolute error in he range [-1e4,1e4] bu a
factor of about 100. The cost 3 is instrucitons (one mul and two
fma’s). The
maximum absolute error in the [-1e4,1e4] range for the improved code
is about 3.5e-6. The absolute error for small ranges is essentially
unaffected, since the correction is linear.
The resulting error of the improved code does not seem to grow with
the argument (phi), as is the case for the uncorrected __sincosf.
This has been tested on the G80 chip (a 8800GT, a 8800GTS and a
8800GTX with identical results).
Indeed, the hardware sin/cos operator is not so bad, except for small and large arguments.
If phi gets very small in magnitude (say, less than 10^-5), then __sin(phi) will return 0. In this case, just returning phi is a very accurate alternative.
(testing the absolute value and comparing it to a constant is done in one instruction, so it will add two extra instructions at most).
But it may not matter in your case if you are just interested in minimizing the absolute error, not the relative error.
One more question: if CUDA provided a fast and accurate __sincospi(x) function that computes sin(pi * x) and cos(pi * x), would you be able to rewrite your code to take advantage of it?
(that is, does phi really needs to be in radians?)