Why is the boundary point between the fast and slow paths for trigonometric functions in CUDA set at 105615.0f / 2147483648.0? How was this value derived?
Why can’t periodic trigonometric functions be scaled down by dividing the input by the period?
The fast paths use a three-stage Cody-Waite reduction process. Even though this is enhanced in terms of accuracy by the use of FMA (fused multiply-add), the process only produces accurate results over a limited range: the three constants used can only provide so many bits of π. Outside that range the more expensive Payne-Hanek reduction scheme is used, with enough bits of π provided to cover the entire numeric range of float
and double
, respectively.
For performance reasons, the numeric values of the switchover points have been maximized. If they were pushed out any further argument reduction would become inaccurate in a hurry.
The algorithms by Cody & Waite and Payne & Hanek are the best reduction schemes for trigonometric functions known in the literature. I am not aware of any practical alternative schemes that provide accurate argument reduction and thus accurate function results. Pointers to relevant newer peer-reviewed publications welcome, I don’t keep up with the literature on a monthly basis.
Depending on your usage of trigonometric functions, you would want to look at the sinpi
, cospi
, and sincospi
functions (and their single-precision equivalents) which provide faster operation for the common situation where the function argument is a multiple of π. I posted a sample implementation of tanpif()
in these forums a few years back. If you need that, I would suggest filing an enhancement request with NVIDIA.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.