A faster and more accurate implementation of sincosf()

I trimmed one instruction from the fastpath of the argument reduction and updated the code in my original post. Instead of using a float->int conversion instruction to return the quadrant, ‘q’, I simply extract it during the fast ‘rint’ computation (only the two LSBs of ‘q’ are needed “downstream” so we do not care for the rest of the data).

I have always wondered whether anybody cares for the exact size of the interval covered by the fast-path in trigonometric functions. Apparently, the authors of this paper do: http://ieeexplore.ieee.org/document/6901738/

I should note that the width of the fast-path interval in the code for my_sincosf() above is limited by the cosine component. If just the sine component is needed, the switch-over point between fast-path and slow-path can be moved all the way to 117435.0.

I just wrote a timing harness to compare speeds for myself, and find a larger speed difference between CUDA and your code.

On my GTX 750Ti, CUDA 8.0 sincosf() evaluates at 6.09 GigaEvals/Sec.
The Juffariffic replacement evaluates at 11.09 GigaEvals/Sec, 1.82x faster, much better than you report.

On my GTX 950, CUDA 8.0 is 11.78 GigaEvals/Sec. Juftastic: 18.55. 1.57x.

I’m sure there could be differences due to harness implementation. And I also realize that the evals/sec would be a lot lower if the arguments were randomized per thread so single warps would tend to have more divergence in having to execute both fast path and slow path. Norbert, perhaps that’s the way you measured? It’d probably be fairest, since it’d expose the divergence penalty more clearly.

My performance test framework does in fact use different data in every thread, and divergence does occur. It is not completely random data, though, and I set it up to never hit the slow path, as that is unlikely to be exercised in real-life applications.

For reasons of simplicity, I do not carefully subtract out residual overhead from necessary memory operations, and as a consequence, the improvements seen with a more carefully calibrated framework should be expected to be a bit higher than what I stated.

My goal here was simply to give an idea as to what kind of performance improvement might be achieved in reasonably realistic circumstances, not to achieve the highest possible benchmark number. If that resulted in an “underpromise and overdeliver” scenario here, that suits me just fine :-)

Applied changes to core approximations for minor improvement in number of correctly rounded results. In case someone is wondering about all the recent updates for accuracy improvements in various previously posted math function codes: I am test-driving a new heuristic for my polynomial approximation generation.

Noticed belatedly that the CUDA compiler (CUDA 8) isn’t able to map the shift sequence in the slowpath to SHF instructions. So do this by hand for >= sm_30, saving seven instructions (slowpath should be as compact as possible).

As for performance, I see about 24% higher throughput for my_sincosf() versus the built-in functions from CUDA 8, on a Quadro K2200 (sm_50). The test covers the fastpath only, as the slowpath should not come into play for 99.99% of real life applications.