device and host qualifiers in same function

Carlo_del_Mundo · February 19, 2012, 6:52am

float2 __device__ __host__ dft_calculation(float2 *input, int k, int num_elements)

{

    float2 sum = make_float2(0.0f, 0.0f);

    for (int j = 0; j < num_elements; ++j)

    {   

        float theta = -2.0f*PI*k*j/num_elements;

        float2 omega = make_float2(cos(theta), sin(theta));

        sum += omega * input[j];

    }   

    return sum;

}

[*]How does cos(), sin() translate to its CUDA counterpart? Specifically, does it map directly to a CUDA-device-implementation rather than say a C library? Basically, when nvcc compiles this code, which version of cos() and sin() does it use?

[*]For performance, I’ve read a little bit about CUDA intrinsic functions such as __cosf, __sinf. Would it be beneficial to directly call these intrinsics rather than let the compiler ‘do what’s best’?

tera · February 19, 2012, 5:18pm

It calls device code when compiled for the GPU, or host code on the CPU.

Calling __cosf() or __sinf() directly definitely improves performance, provided the reduced accuracy and parameter range are sufficient for your application.

njuffa · February 19, 2012, 8:42pm

Whenever both sine and cosine of the same argument are computed, use sincos() or sincosf(), which are faster thanks to shared argument reduction. In this case, since the input argument is multiplied by PI, one would actually want sincospi() / sincospif() but those do not currently exist in CUDA.

__sinf(), __cosf() are not really restricted in their argument range. But they become less and less accurate as the magnitude of their argument increases, so for practical reasons one would want to stick to a fairly narrow range (e.g. +/- 2*PI). Due to quantization effects __sinf() is not very smooth close to zero (pronounced steps), which makes the device intrinsic unsuitable for some codes. Here one would want to use __sincosf() since both sine and cosine are neeeded.

seibert · February 19, 2012, 11:14pm

Are these large steps near zero what drive the 2**(-21.41) error bound listed for __sinf() in the Programming Guide?

njuffa · February 20, 2012, 1:14am

The special function unit in the GPU uses fixed-point interpolation to generate the function values, as described in the following paper:

Stuart F. Oberman, Michael Y. Siu: A High-Performance Area-Efficient Multifunction Interpolator. IEEE Symposium on Computer Arithmetic 2005: 272-279

Normally, in floating-point, sin(x) = x for very small x. But due to the fixed-point quantization, __sinf(x) is zero for very small x, and the function values increase in multiples of the quantization step from there. As a consequence, for arguments of small magnitude absolute error is small, but relative error is high. For the error bound stated in the Programming Guide I simply picked a reasonable interval and had the test app try all arguments inside the interval, so I do not offhand know where the largest error occurs. I guess that the largest absolute error of 2**(-21.41) likely happens close to the interval bounds, not close to zero.

The reason that the error in __sinf(), __cosf() increases with the magnitude of the input is that the argument reduction does not reduce using mathematical Ï€, but uses a machine approximation PI instead. It therefore incurs an ever increasing phase shift as the magnitude of the argument increases. By contrast, sinf() and cosf() reduce their input arguments using an approximation to Ï€ sufficiently accurate that no phase shift occurs across the entire input domain, i.e. the trig function argument reduction behaves as if one had used an infinitely precise mathematical Ï€.

This is a classical tradeoff between performance on one hand, and accuracy and preservation of mathematical properties on the other hand. I have encountered at least one app that ran into trouble with __sinf() due to the quantization effect near zero. My recommendation is to first code CUDA kernels using the standard math functions, and only if performance is insufficient to start experimenting with replacing individual calls with the equivalent intrinsics.

Topic		Replies	Views
[SOLVED] Njuffa's sincosf() vs __sinf() + __cosf() and current sincosf() CUDA Programming and Performance	5	2335	January 26, 2019
native sincos() function? CUDA Programming and Performance	3	4891	March 9, 2007
intrinsic functions __cos() and __sin() for double precision CUDA Programming and Performance	11	3279	January 3, 2013
A faster and more accurate implementation of sincosf() CUDA Programming and Performance	25	9259	August 6, 2017
How to use the math function sincos(x) CUDA Programming and Performance	1	1135	July 16, 2011
Improving the accuracy of the __sincosf-function CUDA Programming and Performance	3	4354	August 18, 2009
__sinf and __cosf errors Minimizing errors CUDA Programming and Performance	5	8444	November 25, 2010
why can't I use __sin() in the kernel,but sin() is OK CUDA Programming and Performance	11	3396	October 21, 2010
please help me on Hardware mathematic functions CUDA Programming and Performance	3	10662	December 21, 2010
sincospif() implementation with improved performance and accuracy CUDA Programming and Performance	8	3351	August 16, 2016

__device__ and __host__ qualifiers in same function

Related topics

device and host qualifiers in same function