fmaf Looking for information about fmaf ??

I try to find some information on fmaf math function…
But it seem that CUDA Toolkit documentation do not provide any information about how to use that function… except the name of the function and parameters…
But what each parameter mean ???
Pretty strange ???

Maby a basic question… but not really easy to find info in the documentation…
Any idea

I try to find some information on fmaf math function…
But it seem that CUDA Toolkit documentation do not provide any information about how to use that function… except the name of the function and parameters…
But what each parameter mean ???
Pretty strange ???

Maby a basic question… but not really easy to find info in the documentation…
Any idea

It’s actually a standard C math function representing a fused-multiply add for single precision floating point numbers. You can find a manpage for it on most any Mac or Linux system, and Google can also find you a manpage on the web.

It’s actually a standard C math function representing a fused-multiply add for single precision floating point numbers. You can find a manpage for it on most any Mac or Linux system, and Google can also find you a manpage on the web.

fmaf() is one of the C99 standard math functions, a single-precision fused-multiply add. The CUDA documentation does not provide documentation for standard C99 math functions at this time. Online man pages for these functions can be located with an internet search engine.

fmaf(a,b,c) computes a*b+c with a single rounding, i.e. the unrounded, double-wide product of a and b participates in the addition with c, and the result of the addition is rounded according to the IEEE rounding mode round-to-nearest-or-even.

CUDA also offers device functions (i.e. intrinsics) that apply one of the four IEEE-754 rounding modes to the single-precision fused multiply-add operation. They are: __fmaf_rn(), __fmaf_rz(), __fmaf_ru(), __fmaf_rd().

For sm_1x platforms, fmaf() and the corresponding device functions are implemented via software emulation. For sm_2x they are supported natively by the hardware.

fmaf() is one of the C99 standard math functions, a single-precision fused-multiply add. The CUDA documentation does not provide documentation for standard C99 math functions at this time. Online man pages for these functions can be located with an internet search engine.

fmaf(a,b,c) computes a*b+c with a single rounding, i.e. the unrounded, double-wide product of a and b participates in the addition with c, and the result of the addition is rounded according to the IEEE rounding mode round-to-nearest-or-even.

CUDA also offers device functions (i.e. intrinsics) that apply one of the four IEEE-754 rounding modes to the single-precision fused multiply-add operation. They are: __fmaf_rn(), __fmaf_rz(), __fmaf_ru(), __fmaf_rd().

For sm_1x platforms, fmaf() and the corresponding device functions are implemented via software emulation. For sm_2x they are supported natively by the hardware.

OK thak’s guy for your help… I appricate it…

So I understand that in my device code, with a sm_1x, fmaf() use emulation. So it’s gone be slower than juste doing x+=a*b; …(that’s what my result show to me, it’s 3 time slower).

But in a sm_2x, it’s gone be faster to use fmaf() or __fmaf_rn() than juste doing x+=a*b;…

I am right about it ???

OK thak’s guy for your help… I appricate it…

So I understand that in my device code, with a sm_1x, fmaf() use emulation. So it’s gone be slower than juste doing x+=a*b; …(that’s what my result show to me, it’s 3 time slower).

But in a sm_2x, it’s gone be faster to use fmaf() or __fmaf_rn() than juste doing x+=a*b;…

I am right about it ???

The software emulation for fmaf() on sm_1x platforms is quite a bit slower than the code generated for ab+c. On sm_2x both idioms have the same speed, provided ab+c gets optimized by the compiler into an FFMA (single-precision fused multiply-add) instruction. This happens frequently, but not always. If you need to be sure (for example if your algorithm depends on the numerical properties of a fused multiply-add) call fmaf() or the equivalent device function __fmaf_rn() directly where the presence of an FMA is required.

The software emulation for fmaf() on sm_1x platforms is quite a bit slower than the code generated for ab+c. On sm_2x both idioms have the same speed, provided ab+c gets optimized by the compiler into an FFMA (single-precision fused multiply-add) instruction. This happens frequently, but not always. If you need to be sure (for example if your algorithm depends on the numerical properties of a fused multiply-add) call fmaf() or the equivalent device function __fmaf_rn() directly where the presence of an FMA is required.

Thank’s.

It’s all clear now.

Thank’s.

It’s all clear now.