On nvida gpu:__frsqrt_rn(0x403a18e3) = 0x3f16209f.
While we use much higher precision as reference and found rsqrt(0x403a18e3) should be rounded to0x3f16209e.
I checked the sass of __frsqrt_rn. It is quiet simple, based on newton iteration(or Goldschmidt). I think that for div and sqrt, there are papers to prove that the final iteration can iterate to 0.5ulp result. But rsqrt can not, it may iterate to 0.500000…1ulp?
The papers are:
Correctness Proofs Outline for Newton-Raphson Based Floating-Point Divide and Square Root Algorithms
High-level algorithms for correctly-rounded reciprocal square roots
In a quick check using higher-precision computation I got the following reference result. This seems to be a hard-to-round case, almost exactly half-way between the nearest representable IEEE-754 binary32 numbers.
Let me now check the result generated by CUDA. FWIW, there is definitely a way to generate correctly rounded reciprocal square roots; a relevant paper appeared not long ago, by C.F. Borges. I think this may be your second reference.
If __frsqrt_rn() does have a bug here, that would probably be my fault unless NVIDIA engineers have since modified the code. I also created the unit test for this function, which likewise would have to be reviewed, since in general, single-input argument float functions are all exhaustively tested, so test escapes should not exist.
As I recall, I made use of the following publication when I created the original implementation: C. Iordache and D.W. Matula, “On infinitely precise rounding for division, square root, reciprocal and square root reciprocal.” In 14th IEEE Symposium on Computer Arithmetic, 1999, pp. 233-240.
I am not able to reproduce the CUDA result reported above. I first tried CUDA 9.2 on Windows 7, and then CUDA 13.1 on Linux at Compiler Explorer. The results match and are as follows:
Carlos F. Borges, Claude-Pierre Jeannerod, and Jean-Michel Muller, “High-level algorithms for correctly-rounded reciprocal square roots.” In 29th IEEE Symposium on Computer Arithmetic, 2022, pp. 18-25.
Full disclosure: I briefly communicated with the lead author when this paper first became available as an ArXiv preprint.
It would be great if you could post a repro-case with your source code, compiler flags and CUDA version + hardware used. Maybe there is a problem, but it only shows up with certain combinations of the above.