What is the algorithmic principle of __frcp_ru function?

For code that:
global void test(float * dst, float* src) {
unsigned tid = threadIdx.x;
dst[tid] = __frcp_ru(src[tid]);
}
It’s sass is:

What is this principle:ru(1/x) = ru(ftz(-ru(1/x * x - 1) - 0) * 1/x + 1/x)

Leave out the ru and ftz and -0 for now.
And then assume that the reciprocal function has a small error.

It returns (1/x + epsilon). You will find out, that the shown principle manages to reduce the error from epsilon to epsilon² * x, which for a small epsilon (compared to x) is a more precise result.

Perhaps @njuffa can tell even more.

1 Like

I do not recognize this particular code. I last worked on these intrinsics around 2008, and their implementation may well have changed. The snippet shown is definitely not the full code of __frcp_ru(), but is likely part of its fast path. The mathematical principle behind it is the Newton iteration for the reciprocal of A:

hn=1-Axn
xn+1=xn+xnhn

The first FMA in the snippet computes the error term, the second the refined reciprocal. The FADD in the snippet is a SASS idiom for negation, the addition of -0 is needed to properly handle negative zero, since (-0) + (+0) = (+0) according to IEEE-754 semantics. I am not sure why the negation is split out as a separate operation, given that FFMA supports negation of the product, e.g. FFMA R4, -R3, R2, 1. It may be needed to handle certain corner cases correctly. The .FTZ (flush to zero) suffix is a solid indication that this code path is only used for normal floating-point operands. Computation involving sub-normal operands is handled in the slow path.

1 Like

Thanks very much~

Can you tell me more about how this slow path works?I want to know how to handle sub-normal number and non-number

I do not recall these details. For special operands, refer to the IEEE-754 specifications of how they need to be handled. Beyond that, you could translate the machine instructions from the disassembly back into C code and try to understand things from that.