[SOLVED] Njuffa's sincosf() vs sinf() + cosf() and current sincosf()

saulocpp · January 25, 2019, 10:11pm

I was working with the intrinsic functions and needed some information, and found this thread:
https://devtalk.nvidia.com/default/topic/960757/cuda-programming-and-performance/a-faster-and-more-accurate-implementation-of-sincosf-/2

Since it is from 2016, I’d like to know:
1 - Is it already incorporated in the toolkit? I’m currently using 9.1.
2 - In either case, is sincosf() a direct replacement for __sinf() and __cosf() in this situation:

__global__ void cuda_Euler(const float * __restrict__ real, float *imag, float *output, const float ANGLE, const int LENGTH)
    {
    int	tid	= blockDim.x * blockIdx.x + threadIdx.x,
        offset	= gridDim.x * blockDim.x;

    while(tid < LENGTH)
        {
        output[tid] = real[tid] * __cosf(ANGLE) + imag[tid] * __sinf(ANGLE);
        tid += offset;
        }
    }

Keeping in mind that this function is clearly memory-bound.

njuffa · January 25, 2019, 10:53pm

You can certainly replace the __sinf() and __cosf() in your code with a call to sincosf(). But the basic trade-offs regarding use of intrinsics remain:

sincosf() provides accurate results across the entire possible range of ‘float’ inputs. __sinf() and __cosf() will provide not quite so accurate results on the unit circle, and quantization artifacts from the underlying fixed-point computation may be apparent in some use cases. The accuracy of the intrinsics gets worse as the arguments increase in magnitude.

2_ Use of __sinf() and __cosf() will be much more efficient (should be three instructions altogether: an RRO.SINCOS range reduction instruction followed my MUFU.SIN and MUFU.COS. An accurate sincosf() implementation on the other hand requires on the order of ten times as many instructions. Since your code is memory bound however, that computational efficiency should have little to no bearing on the performance of the kernel.

The easiest way to assess the trade-offs in the context of your use case is to simply try it and profile the resulting code, independent of the underlying implementation of sincosf(). You can also compare CUDA’s built-in sincosf() with the code I posted. That’s a ten-minute experiment altogether.

saulocpp · January 25, 2019, 11:31pm

Njuffa, thanks for the clarification.
I will certainly check the sincosf() implementations and I agree the impact should be minimal, if any.

But since you are already here and we are discussing intrinsics, one of my functions uses sqrtf():

__global__ void cuda_Instant_Amp(const float * __restrict__ real, float *quadrature, float *output, const int LENGTH)
    {
    int	tid	= blockDim.x * blockIdx.x + threadIdx.x,
        offset	= gridDim.x * blockDim.x;

    while(tid < LENGTH)
        {
        output[tid] = sqrtf((real[tid] * real[tid]) + (quadrature[tid] * quadrature[tid]));
        tid += offset;
        }
    }

NVVP shows 62% of divergent branches, 80% of warp execution efficiency and 74% of non-predicated warp execution efficiency. This thread shows a similar behavior: https://stackoverflow.com/questions/20640309/a-cuda-kernel-to-optimize

This kernel takes more or less the same time as other kernels to run, working exactly on the same amount of data and also memory-bound.
Is it something I should worry about, or there is nothing to be fixed here?

njuffa · January 25, 2019, 11:39pm

That’s a good indication that there’s nothing to worry about.

As for divergent branches, the sqrtf() implementation certainly uses some branches, although I would not expect that much divergence to occur with those unless your data is exercising the full spectrum of ‘float’ operands. In practice, most use cases involve operands to sqrt() that distribute fairly closely around 1.0, and very little divergence should occur.

saulocpp · January 26, 2019, 12:13am

I am using values between -1 and +1 and the imaginary part could be as you say, using the full spectrum of float. At some point I get a lot of NaN out of the computation, which could justify this much divergence from sqrtf().
My equation was also incorrect, quadrature should be squared (fixed now), so there is certainly more to inspect. But now I have enough information to move forward.

Danke, mein Herr.

njuffa · January 26, 2019, 5:11am

That is a plausible explanation for the observed high percentage of divergent branches, because NaNs are handled by the “slow path” of the sqrtf() implementation.

Da nich’ für. [regional Northern German for: Don’t mention it]

Topic		Replies	Views
__device__ and __host__ qualifiers in same function CUDA Programming and Performance	4	3310	February 20, 2012
A faster and more accurate implementation of sincosf() CUDA Programming and Performance	25	9829	August 6, 2017
native sincos() function? CUDA Programming and Performance	3	4943	March 9, 2007
intrinsic functions __cos() and __sin() for double precision CUDA Programming and Performance	11	3450	January 3, 2013
How to use the math function sincos(x) CUDA Programming and Performance	1	1181	July 16, 2011
Improving the accuracy of the __sincosf-function CUDA Programming and Performance	3	4415	August 18, 2009
Look-Up Table vs __sincosf for Large-Scale Random Phase Calculations in Radio Astronomy Pipeline CUDA Programming and Performance cuda , kernel	20	89	December 30, 2025
On the utility of SFU instructions for half-precision math functions CUDA Programming and Performance	8	2631	September 16, 2019
Fermi and Kepler GPU Special Function Units CUDA Programming and Performance	9	17329	June 22, 2013
Number of Cosine and Sine in a K40 CUDA Programming and Performance	4	1173	December 16, 2014

[SOLVED] Njuffa's sincosf() vs __sinf() + __cosf() and current sincosf()

Related topics

[SOLVED] Njuffa's sincosf() vs sinf() + cosf() and current sincosf()