__device__ and __host__ qualifiers in same function

float2 __device__ __host__ dft_calculation(float2 *input, int k, int num_elements)

{

    float2 sum = make_float2(0.0f, 0.0f);

    for (int j = 0; j < num_elements; ++j)

    {   

        float theta = -2.0f*PI*k*j/num_elements;

        float2 omega = make_float2(cos(theta), sin(theta));

        sum += omega * input[j];

    }   

    return sum;

}

    How does cos(), sin() translate to its CUDA counterpart? Specifically, does it map directly to a CUDA-device-implementation rather than say a C library? Basically, when nvcc compiles this code, which version of cos() and sin() does it use?

    For performance, I’ve read a little bit about CUDA intrinsic functions such as __cosf, __sinf. Would it be beneficial to directly call these intrinsics rather than let the compiler ‘do what’s best’?

It calls device code when compiled for the GPU, or host code on the CPU.

Calling __cosf() or __sinf() directly definitely improves performance, provided the reduced accuracy and parameter range are sufficient for your application.

Whenever both sine and cosine of the same argument are computed, use sincos() or sincosf(), which are faster thanks to shared argument reduction. In this case, since the input argument is multiplied by PI, one would actually want sincospi() / sincospif() but those do not currently exist in CUDA.

__sinf(), __cosf() are not really restricted in their argument range. But they become less and less accurate as the magnitude of their argument increases, so for practical reasons one would want to stick to a fairly narrow range (e.g. +/- 2*PI). Due to quantization effects __sinf() is not very smooth close to zero (pronounced steps), which makes the device intrinsic unsuitable for some codes. Here one would want to use __sincosf() since both sine and cosine are neeeded.

Are these large steps near zero what drive the 2**(-21.41) error bound listed for __sinf() in the Programming Guide?

The special function unit in the GPU uses fixed-point interpolation to generate the function values, as described in the following paper:

Stuart F. Oberman, Michael Y. Siu: A High-Performance Area-Efficient Multifunction Interpolator. IEEE Symposium on Computer Arithmetic 2005: 272-279

Normally, in floating-point, sin(x) = x for very small x. But due to the fixed-point quantization, __sinf(x) is zero for very small x, and the function values increase in multiples of the quantization step from there. As a consequence, for arguments of small magnitude absolute error is small, but relative error is high. For the error bound stated in the Programming Guide I simply picked a reasonable interval and had the test app try all arguments inside the interval, so I do not offhand know where the largest error occurs. I guess that the largest absolute error of 2**(-21.41) likely happens close to the interval bounds, not close to zero.

The reason that the error in __sinf(), __cosf() increases with the magnitude of the input is that the argument reduction does not reduce using mathematical π, but uses a machine approximation PI instead. It therefore incurs an ever increasing phase shift as the magnitude of the argument increases. By contrast, sinf() and cosf() reduce their input arguments using an approximation to π sufficiently accurate that no phase shift occurs across the entire input domain, i.e. the trig function argument reduction behaves as if one had used an infinitely precise mathematical π.

This is a classical tradeoff between performance on one hand, and accuracy and preservation of mathematical properties on the other hand. I have encountered at least one app that ran into trouble with __sinf() due to the quantization effect near zero. My recommendation is to first code CUDA kernels using the standard math functions, and only if performance is insufficient to start experimenting with replacing individual calls with the equivalent intrinsics.