Read appendix B of the programming guide to understand CUDA’s numeric function options and tradeoffs.
sin() uses local memory as part of a lookup table. This minimizes register use in general. Imagine if you have a tuned kernel that uses exactly N registers, then a single cos() call jumps your register count to N+5 or whatever and changes your whole allowable blocks per SM, etc. It could be a disaster.
But, but, but… you cry, it’s so slow!
But that’s why you should use _sinf() which is fast and uses few registers. The tradeoff is it’s only useful to an error of about 2^-21
and also expects arguments from -pi to pi.
“The argument reduction for trig function has a fast path and a slow path (for very large arguments). In practice it is very unlikely that the slow path will ever be exercised, but it needs to be there for correctness. To reduce register usage in the slow path, some local memory is used. Local memory is not used in the fast path. This applies to both single-precision and double-precision versions.”
Maybe it would be nice to include this information into the Programming Guide. Seeing that sudden local memory usage, and not knowing where it comes from (or what it is going to do) is a bit scary :)