CUDA Trigonometric Function Issues

Why is the boundary point between the fast and slow paths for trigonometric functions in CUDA set at 105615.0f / 2147483648.0? How was this value derived?
Why can’t periodic trigonometric functions be scaled down by dividing the input by the period?

The fast paths use a three-stage Cody-Waite reduction process. Even though this is enhanced in terms of accuracy by the use of FMA (fused multiply-add), the process only produces accurate results over a limited range: the three constants used can only provide so many bits of π. Outside that range the more expensive Payne-Hanek reduction scheme is used, with enough bits of π provided to cover the entire numeric range of float and double, respectively.

For performance reasons, the numeric values of the switchover points have been maximized. If they were pushed out any further argument reduction would become inaccurate in a hurry.

The algorithms by Cody & Waite and Payne & Hanek are the best reduction schemes for trigonometric functions known in the literature. I am not aware of any practical alternative schemes that provide accurate argument reduction and thus accurate function results. Pointers to relevant newer peer-reviewed publications welcome, I don’t keep up with the literature on a monthly basis.

Depending on your usage of trigonometric functions, you would want to look at the sinpi, cospi, and sincospi functions (and their single-precision equivalents) which provide faster operation for the common situation where the function argument is a multiple of π. I posted a sample implementation of tanpif() in these forums a few years back. If you need that, I would suggest filing an enhancement request with NVIDIA.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.