__frsqrt_rn is not accurate 0.5ulp? I found a number

Here input float is its 32-bit raw data.

On nvida gpu:__frsqrt_rn(0x403a18e3) = 0x3f16209f.

While we use much higher precision as reference and found rsqrt(0x403a18e3) should be rounded to 0x3f16209e.

I checked the sass of __frsqrt_rn. It is quiet simple, based on newton iteration(or Goldschmidt). I think that for div and sqrt, there are papers to prove that the final iteration can iterate to 0.5ulp result. But rsqrt can not, it may iterate to 0.500000…1ulp?

The papers are:

Correctness Proofs Outline for Newton-Raphson Based Floating-Point Divide and Square Root Algorithms

High-level algorithms for correctly-rounded reciprocal square roots

In a quick check using higher-precision computation I got the following reference result. This seems to be a hard-to-round case, almost exactly half-way between the nearest representable IEEE-754 binary32 numbers.

tmp=    0x9.6209e7ffffff4bc0p-4
ref=  5.86435199e-01   0x1.2c413cp-1   3f16209e

Let me now check the result generated by CUDA. FWIW, there is definitely a way to generate correctly rounded reciprocal square roots; a relevant paper appeared not long ago, by C.F. Borges. I think this may be your second reference.

If __frsqrt_rn() does have a bug here, that would probably be my fault unless NVIDIA engineers have since modified the code. I also created the unit test for this function, which likewise would have to be reviewed, since in general, single-input argument float functions are all exhaustively tested, so test escapes should not exist.

As I recall, I made use of the following publication when I created the original implementation: C. Iordache and D.W. Matula, “On infinitely precise rounding for division, square root, reciprocal and square root reciprocal.” In 14th IEEE Symposium on Computer Arithmetic, 1999, pp. 233-240.

I am not able to reproduce the CUDA result reported above. I first tried CUDA 9.2 on Windows 7, and then CUDA 13.1 on Linux at Compiler Explorer. The results match and are as follows:

arg= 2.90776896e+00    0x1.7431c6p+1  403a18e3
res= 5.86435199e-01    0x1.2c413cp-1  3f16209e

See below for the program code, or visit the link to Compiler Explorer above.

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <math.h>

__global__ void kernel (uint32_t argi)
{
    float arg = __int_as_float (argi);
    printf ("arg=%15.8e  %15.6a  %08x\n", arg, arg, argi);
    float res = __frsqrt_rn (arg);
    uint32_t resi = __float_as_int (res);
    printf ("res=%15.8e  %15.6a  %08x\n", res, res, resi);
}

int main (void)
{
    uint32_t argi = 0x403a18e3;
    kernel<<<1,1>>>(argi);
    cudaDeviceSynchronize();
    return EXIT_SUCCESS;
}

The paper by Borges I mentioned is

Carlos F. Borges, Claude-Pierre Jeannerod, and Jean-Michel Muller, “High-level algorithms for correctly-rounded reciprocal square roots.” In 29th IEEE Symposium on Computer Arithmetic, 2022, pp. 18-25.

Full disclosure: I briefly communicated with the lead author when this paper first became available as an ArXiv preprint.

It would be great if you could post a repro-case with your source code, compiler flags and CUDA version + hardware used. Maybe there is a problem, but it only shows up with certain combinations of the above.