arithmetical functions various

RafaA · July 5, 2009, 12:50am

How is the sinus in PTX realized as sin ?

Great

RafaÅ‚

Simon_Green · July 6, 2009, 9:40am

The hardware includes a sin instruction (executed in SFU), which is further refined in software by CUDA.

Simon_Green · July 6, 2009, 4:55pm

It turns out this isn’t quite correct.

The CUDA sinf() function is actually a pure software implementation. You can look in the header files for the details.

The PTX sin instruction (the __sinf() intrinsic in CUDA) maps to a two instruction sequence in the hardware - the FMAD-pipe instruction RRO (range reduction and conversion to fixed-point) followed by the SFU’s SIN instruction.

If you’re interested in the details, the SFU is described in this paper:

Stuart F. Oberman and Michael Siu, “A high-performance area-efficient multifunction interpolator”, Proceedings of the 17th IEEE Symposium on Computer Arithmetic (2005), pp. 272-279

Sylvain_Collange · July 6, 2009, 5:23pm

Simon,

Thanks for giving such details. This is the first time I see the RRO instruction mentioned outside of NVIDIA patents. Just one technical question though…

According to the CUDA manual (5.1.1.1), the throughput of __sinf() is 1 operation/cycle, or 32 cycles/warp.

__log2f (*) which doesn’t require range reduction has a throughput of 2 ops/cycles, or 16 cycles/warp.

Which means the RRO instruction requires 16 cycles/warp to execute. This doesn’t sound right for a FMAD-pipe instruction (they usually execute in 4 cycles/warp inside the 8 FMAD units).

Is there something I overlooked, is the manual inaccurate or is it actually a SFU instruction?

(*) The CUDA 2.0 manual mentions __logf() and __expf(). I assume it meant __log2f() and __exp2f() instead?

RafaA · July 7, 2009, 6:39am

thanks
what kind of polynomial uses Nvidia ?
I do not read this , but i know theme .

RafaÅ‚

Sylvain_Collange · July 7, 2009, 7:31am

It is based on successive Remez polynomial approximations, rounding the coefficients to the target precision one by one.

As Simon suggested, you can read this paper which describes the hardware unit, and cuda/include/math_functions.h for the software implementation.

BTW, my experiments confirm that the RRO instruction has a throughput of 4 cycles/warp.

Simon_Green · July 7, 2009, 9:11am

Yes, this looks like a bug in the documentation.

RafaA · July 7, 2009, 11:25am

Thanks

RafaÅ‚

Topic		Replies	Views
SFUs CUDA Programming and Performance	4	6423	April 16, 2008
native sincos() function? CUDA Programming and Performance	3	4909	March 9, 2007
Fermi and Kepler GPU Special Function Units CUDA Programming and Performance	9	17097	June 22, 2013
Does PTX support double sin() and cos()? CUDA Programming and Performance	4	1567	November 17, 2014
Clock cycles of math functions CUDA Programming and Performance	1	2365	December 9, 2008
Double precision sine/cosine CUDA Programming and Performance	1	840	May 17, 2011
Which instructions get executed by SFU units ? CUDA Programming and Performance	1	9498	November 3, 2009
please help me on Hardware mathematic functions CUDA Programming and Performance	3	10674	December 21, 2010
__device__ and __host__ qualifiers in same function CUDA Programming and Performance	4	3273	February 20, 2012
A doubt... 64- 32- and 24-bit math... CUDA Programming and Performance	4	3366	September 23, 2009

arithmetical functions various

Related topics