A faster and more accurate implementation of sincosf()

njuffa · December 19, 2016, 3:06pm

I trimmed one instruction from the fastpath of the argument reduction and updated the code in my original post. Instead of using a float->int conversion instruction to return the quadrant, ‘q’, I simply extract it during the fast ‘rint’ computation (only the two LSBs of ‘q’ are needed “downstream” so we do not care for the rest of the data).

njuffa · December 25, 2016, 11:11am

I have always wondered whether anybody cares for the exact size of the interval covered by the fast-path in trigonometric functions. Apparently, the authors of this paper do: http://ieeexplore.ieee.org/document/6901738/

I should note that the width of the fast-path interval in the code for my_sincosf() above is limited by the cosine component. If just the sine component is needed, the switch-over point between fast-path and slow-path can be moved all the way to 117435.0.

SPWorley · February 19, 2017, 6:18am

I just wrote a timing harness to compare speeds for myself, and find a larger speed difference between CUDA and your code.

On my GTX 750Ti, CUDA 8.0 sincosf() evaluates at 6.09 GigaEvals/Sec.
The Juffariffic replacement evaluates at 11.09 GigaEvals/Sec, 1.82x faster, much better than you report.

On my GTX 950, CUDA 8.0 is 11.78 GigaEvals/Sec. Juftastic: 18.55. 1.57x.

I’m sure there could be differences due to harness implementation. And I also realize that the evals/sec would be a lot lower if the arguments were randomized per thread so single warps would tend to have more divergence in having to execute both fast path and slow path. Norbert, perhaps that’s the way you measured? It’d probably be fairest, since it’d expose the divergence penalty more clearly.

njuffa · February 19, 2017, 6:48am

My performance test framework does in fact use different data in every thread, and divergence does occur. It is not completely random data, though, and I set it up to never hit the slow path, as that is unlikely to be exercised in real-life applications.

For reasons of simplicity, I do not carefully subtract out residual overhead from necessary memory operations, and as a consequence, the improvements seen with a more carefully calibrated framework should be expected to be a bit higher than what I stated.

My goal here was simply to give an idea as to what kind of performance improvement might be achieved in reasonably realistic circumstances, not to achieve the highest possible benchmark number. If that resulted in an “underpromise and overdeliver” scenario here, that suits me just fine :-)

njuffa · August 6, 2017, 7:13am

Applied changes to core approximations for minor improvement in number of correctly rounded results. In case someone is wondering about all the recent updates for accuracy improvements in various previously posted math function codes: I am test-driving a new heuristic for my polynomial approximation generation.

njuffa · August 6, 2017, 7:01pm

Noticed belatedly that the CUDA compiler (CUDA 8) isn’t able to map the shift sequence in the slowpath to SHF instructions. So do this by hand for >= sm_30, saving seven instructions (slowpath should be as compact as possible).

As for performance, I see about 24% higher throughput for my_sincosf() versus the built-in functions from CUDA 8, on a Quadro K2200 (sm_50). The test covers the fastpath only, as the slowpath should not come into play for 99.99% of real life applications.

Topic		Replies	Views
sincospif() implementation with improved performance and accuracy CUDA Programming and Performance	8	3626	August 16, 2016
h2sin performance CUDA Programming and Performance	24	2257	July 16, 2019
Accuracy-optimized implementation of tanf(), without performance impact CUDA Programming and Performance	1	574	July 5, 2022
[SOLVED] Njuffa's sincosf() vs __sinf() + __cosf() and current sincosf() CUDA Programming and Performance	5	2548	January 26, 2019
Improving the accuracy of the __sincosf-function CUDA Programming and Performance	3	4423	August 18, 2009
Weekend project: Very accurate double-precision sincos() implementation for a restricted domain CUDA Programming and Performance	0	62	December 14, 2025
Fastmath functions Speed or accuracy CUDA Programming and Performance	8	21716	April 16, 2009
Accuracy in GPU floating point calculations CUDA Programming and Performance	35	8619	September 9, 2011
A more accurate (and potentially faster) double-precision sincospi() implementation CUDA Programming and Performance	0	1752	August 12, 2015
trigonometric functions standard c v/s cuda CUDA Programming and Performance	13	6025	October 25, 2015

A faster and more accurate implementation of sincosf()

Related topics