use of __frcp_rz()

CudaaduC · November 20, 2014, 12:27am

This may be a naive question, and I did Google it found no resource which answered this specific question:

If I have a situation such as this:

a/b + c

would there be any advantages or disadvantages to doing something like this:

fmaf(a,__frcp_rz(b),c)

would be accuracy be any worse for this implementation when compared to the first? Assume I have the --use_fast_math flag set, so the division would be fdividef() anyway.

njuffa · November 20, 2014, 1:10am

Without context, it’s difficult to say whether any particular change represents an advantage. I will discuss this in terms of trade-offs.

__frcp_rz() is a single-precision division with IEEE-754 compliant rounding, using the rounding mode “round towards zero”, in other words, truncation. Since this is a directed rounding mode, the result can differ by 1 ulp from the infinitely precise result. This intrinsic maps to the PTX instruction rcp.f32{.ftz}.rz.

Compare this to __frcp_rn(), which is a single-precision division that uses the IEEE-754 “round to nearest or even” rounding mode and guarantees the result is within 0.5 ulp of the infinitely precise result. In the compiler’s default compilation mode, a single-precision reciprocal computation “1.0f / float_var” will be turned into the same PTX instruction as __frcp_rn(), namely rcp.f32{.ftz}.rn.

Both rcp.f32.{ftz}.rn and rcp.f32{.ftz}.rz are emulation routines at the machine code (SASS) level. Since you compile with --use_fast_math which includes -ftz=true, the .ftz version of these PTX instructions will be generated.

When you compile with --use_math_math, single-precision reciprocal computation “1.0f / float_var” is mapped to the PTX instruction rcp.approx.ftz.f32. This maps to a single machine instruction and will therefore be much faster than __frcp_rz(). In terms of error rcp.approx.ftz.f32 has about +/- 1 ulp error versus the infinitely precise result. I don’t know the exact value of the top off my head. So the magnitude of the error is similar to __frcp_rz() but the error in __frcp_rz() is always toward the next smaller machine number while the error in the approximate reciprocal is “random” towards either the next larger or smaller machine number. It may have some bias based on the hardware’s fixed-point table interpolation scheme used to generate the result, but I have never studied that in detail.