Fermi and Kepler GPU Special Function Units

JFSebastian · February 19, 2013, 3:17pm

The Fermi GPUs have Special Function Units (SFUs) to (quoting the NVIDIA White Paper on Fermi) "execute transcendental instructions such as sin, cosine, reciprocal, and square root. Each SFU executes one instruction per thread, per clock".

My questions are:

Do SFUs operate on single and double precision numbers or on single precision only?
Do SFUs introduce any loss of accuracy in the computations?
Are SFUs related to the use of intrinsics like __sin(), __cos(), etc.?
Are the functionalities of the Kepler SFUs the same as for the Fermi SFUs?

Thank you very much in advance for any answer.

seibert · February 19, 2013, 4:46pm

The SFUs work on single precision numbers only.
Yes, see #3.
The SFU instructions are the implementation of the intrinsic functions like __sin(), __cos(), etc. Those functions have limited precision, as detailed in Table 7 of the CUDA Programming Guide. When you call cos(), you do not use the SFU, but instead perform several FMAD instructions that implement a more precise approximation of the trascendental function.

If you pass -ffast-math to nvcc, it will automatically use the intrinsic versions of the transcendentals, otherwise you have to call them explicitly.

This I don’t know. I haven’t seen any indication in the documentation, but I’m not sure.

njuffa · February 19, 2013, 8:24pm

Re (3): I think a better way of looking at this is that the device intrinsics __log2f(), __sinf(), __cosf() expose the instructions implemented by the special function unit :-) The HW implementation is based on quadratic interpolation in ROM tables using fixed-point arithmetic, as described in the following paper:

Stuart F. Oberman and Michael Siu. A high-performance area-efficient multifunction interpolator. In Proceedings of the 17th IEEE Symposium on Computer Arithmetic (Cap Cod, USA), pages 272–279, July 2005.

Re (4): I am not aware of any functional differences between the Fermi and Kepler special function units. Side remark: The special-function instructions actually show up in disassembled SASS code as MUFU.{LG|EX2|SIN|COS|RCP|RSQ} for sm_20 and up. I assume MUFU stands for “multi-function unit”. The relative throughput of these instructions was improved on Kepler compared to Fermi.

cmaster.matso · February 20, 2013, 2:55pm

And what about square-root function? Is it performed by SFU for both intrinsic and non-intrinsic? Or the
non-intrinsic one is ‘done elsewhere’?

MK

njuffa · February 20, 2013, 4:24pm

If you look at the PTX, there is sqrt.approx.f32 and sqrt.rn.f32. The former is an approximate single-precision square root implemented via MUFU.RSQ and MUFU.RCP, while the latter is a single-precision square root with IEEE-754 rounding to nearest-or-even which maps to a sequence of quite a few instructions, one of which is MUFU.RSQ. By disassembling code that contains a call to sqrtf() with cuobjdump --dump-sass, you can easily check this yourself.

On sm_1x, sqrtf() always maps to sqrt.approx.f32, for newer platforms sqrtf() maps to sqrt.rn.f32 by default, but maps to sqrt.approx.f32 if -prec-sqrt=false or -use_fast_math is passed on the nvcc command line. To get a IEEE-754 rounded single-precision square root on sm_1x, one has to use the intrinsic __fsqrt_rn(), which maps to fairly slow emulation code.

rjl · February 20, 2013, 6:11pm

Can you clarify which intrinsics operate in one clock on the SFU? For simplicity, answer for sm_2x and above.

__fsqrt_rd()
__fsqrt_rn()
__fsqrt_ru()
__fsqrt_rz()

Also, how many clocks does __powf() take?

Lastly (and now I know I am wrong), I had thought that all single precision intrinsics listed here were one cycle.

http://developer.download.nvidia.com/compute/cuda/4_2/rel/toolkit/docs/online/group__CUDA__MATH__INTRINSIC__SINGLE.html (I realize this is 4.2, but it is where google takes me – I care about 5.0).

How can I know which intrinsics operate in one clock?

njuffa · February 20, 2013, 9:23pm

The fact that a function is provided as an intrinsic (with leading double underscore, only available in device code) does not imply anything in particular about performance. The performance of single-precision intrinsics can also vary with compilation mode, in particular -ftz={true|false}. I would suggest measuring the throughput of those functions you care about, on a relevant GPU with relevant compiler switches. I have not had the need to perform such measurements for any app optimization work.

JFSebastian · February 21, 2013, 10:23pm

Thank you very much to njuffa for the answers and the very interesting suggested paper which helped me to have a better picture of how SFUs work. As long as I understand, the HW calculating the intrinsic functions basically implement an algorithm approximating those functions by a quadratic polynomial. The coefficient of such a polynomial are determined by a minimax optimization, which I think amounts at approximating the function by a second-order Chebyshev polynomial (which is the solution of a minimax problem).

I have another couple of questions:

Could you recommend any reference describing how the single/double precision (non-intrinsic) transcendental functions are calculated by CUDA?
In many engineering applications, a commonly used function is the sinc function (sin(x)/x), which is the composition of sin(x) and the reciprocal of x. Of course, there are also others (e.g., sinh(x)/x, hamming, hanning) functions representative of filters etc.. It would be interesting to have a fast way, to be implemented by the developers, to directly calculating those function compositions rather than calculating each function component. I guess that the calculation strategy should adapt to the function characteristics. Could you recommend any reference or book giving guidelines on this topic?

Thank you again.

Lastly, and concerning rjl’s comment, from the white paper “NVIDIA’s Fermi: The First Complete
GPU Computing Architecture”, by Peter N. Glaskowsky (<a target=‘_blank’ rel=‘noopener noreferrer’ href='Page Not Found | NVIDIA>http://www.nvidia.com/content/PDF/fermi_white_papers/P.Glaskowsky_NVIDIA’s_Fermi-The_First_Complete_GPU_Architecture.pdf), page 21: “A warp of 32 special-function
instructions is issued in a single cycle but takes eight cycles to complete on the four
SFUs” on the Fermi architecture, see also Fig. 7. Perhaps this will add a piece of information to answer your question…

njuffa · February 21, 2013, 11:23pm

Please note that CUDA intrinsics rarely map to a single SFU/MUFU instruction, but usually map to sequences of multiple SFU and non-SFU instructions. Different GPU have different throughputs for the various operations involved, so if one needs to know the throughput of a particular intrinsic on a particular GPU it would be best to simply measure it.

The core approximations used for the transcendental functions in the CUDA math library are pretty much all straightforward polynomial minimax approximation. These approximations were generated with the Remez algorithm, ready-to-use versions of which are provided by software like Mathematica and Maple. The argument reductions usually follow standard approaches, the references for any special techniques used are noted in comments inside the header files math_functions.h (single precision) and math_functions_dbl_ptx3.h (double precision) that are part of the CUDA distribution.

As a general starting point for floating-point computations, I usually recommend Muller et. al. “Handbook of Floating-Point Arithmetic”: [url]http://perso.ens-lyon.fr/jean-michel.muller/Handbook.html[/url]

Regarding books useful for the development of one’s own transcendental function implementations, I gave a short overview in the following thread on Stackoverflow:
[url]http://stackoverflow.com/questions/99620/books-on-the-algorithims-needed-for-calculating-trancendental-functions/7464239#7464239[/url]

SPWorley · June 22, 2013, 4:32am

Are the SIMD-in-a-word Video Instructions performed by the SFU? By the FP64 cores? Maybe even the LD/ST unit? Or some other undocumented unit?

Apparently they are not executed by the main FP32 cores, since their use does not impact integer addition throughput. This was discussed at the great GTC talk on accelerating Smith Waterman matching.

I’m curious because I keep experimenting to optimize integer throughput, and more understanding of the hardware is always helpful!

Topic		Replies	Views
SFUs CUDA Programming and Performance	4	6479	April 16, 2008
On the utility of SFU instructions for half-precision math functions CUDA Programming and Performance	8	2694	September 16, 2019
Which instructions get executed by SFU units ? CUDA Programming and Performance	1	9531	November 3, 2009
[SOLVED] Njuffa's sincosf() vs __sinf() + __cosf() and current sincosf() CUDA Programming and Performance	5	2542	January 26, 2019
Number of Cosine and Sine in a K40 CUDA Programming and Performance	4	1189	December 16, 2014
Parallelism in SFUs CUDA Programming and Performance cuda	1	786	March 11, 2021
Special function unit performance in future GeForce CUDA Programming and Performance	2	924	April 13, 2013
Which counter(s) record sin/cos/sqrt etc. FLOPS? Nsight Compute	4	262	November 8, 2024
intrinsic functions __cos() and __sin() for double precision CUDA Programming and Performance	11	3489	January 3, 2013
__device__ and __host__ qualifiers in same function CUDA Programming and Performance	4	3313	February 20, 2012

Fermi and Kepler GPU Special Function Units

Related topics