How is the sinus in PTX realized as sin ?

Great

RafaÅ‚

How is the sinus in PTX realized as sin ?

Great

RafaÅ‚

The hardware includes a sin instruction (executed in SFU), which is further refined in software by CUDA.

It turns out this isn’t quite correct.

The CUDA sinf() function is actually a pure software implementation. You can look in the header files for the details.

The PTX sin instruction (the __sinf() intrinsic in CUDA) maps to a two instruction sequence in the hardware - the FMAD-pipe instruction RRO (range reduction and conversion to fixed-point) followed by the SFU’s SIN instruction.

If you’re interested in the details, the SFU is described in this paper:

Stuart F. Oberman and Michael Siu, “A high-performance area-efficient multifunction interpolator”, Proceedings of the 17th IEEE Symposium on Computer Arithmetic (2005), pp. 272-279

Simon,

Thanks for giving such details. This is the first time I see the RRO instruction mentioned outside of NVIDIA patents. Just one technical question though…

According to the CUDA manual (5.1.1.1), the throughput of __sinf() is 1 operation/cycle, or 32 cycles/warp.

__log2f (*) which doesn’t require range reduction has a throughput of 2 ops/cycles, or 16 cycles/warp.

Which means the RRO instruction requires 16 cycles/warp to execute. This doesn’t sound right for a FMAD-pipe instruction (they usually execute in 4 cycles/warp inside the 8 FMAD units).

Is there something I overlooked, is the manual inaccurate or is it actually a SFU instruction?

(*) The CUDA 2.0 manual mentions __logf() and __expf(). I assume it meant __log2f() and __exp2f() instead?

thanks

what kind of polynomial uses Nvidia ?

I do not read this , but i know theme .

RafaÅ‚

It is based on successive Remez polynomial approximations, rounding the coefficients to the target precision one by one.

As Simon suggested, you can read this paper which describes the hardware unit, and cuda/include/math_functions.h for the software implementation.

BTW, my experiments confirm that the RRO instruction has a throughput of 4 cycles/warp.

Yes, this looks like a bug in the documentation.

Thanks

RafaÅ‚