__frsqrt_rn is not accurate 0.5ulp? I found a number

975337104 · December 10, 2025, 7:06am

Here input float is its 32-bit raw data.

On nvida gpu:__frsqrt_rn(0x403a18e3) = 0x3f16209f.

While we use much higher precision as reference and found rsqrt(0x403a18e3) should be rounded to 0x3f16209e.

I checked the sass of __frsqrt_rn. It is quiet simple, based on newton iteration(or Goldschmidt). I think that for div and sqrt, there are papers to prove that the final iteration can iterate to 0.5ulp result. But rsqrt can not, it may iterate to 0.500000…1ulp?

The papers are:

Correctness Proofs Outline for Newton-Raphson Based Floating-Point Divide and Square Root Algorithms

High-level algorithms for correctly-rounded reciprocal square roots

njuffa · December 10, 2025, 7:33am

In a quick check using higher-precision computation I got the following reference result. This seems to be a hard-to-round case, almost exactly half-way between the nearest representable IEEE-754 binary32 numbers.

tmp=    0x9.6209e7ffffff4bc0p-4
ref=  5.86435199e-01   0x1.2c413cp-1   3f16209e

Let me now check the result generated by CUDA. FWIW, there is definitely a way to generate correctly rounded reciprocal square roots; a relevant paper appeared not long ago, by C.F. Borges. I think this may be your second reference.

If __frsqrt_rn() does have a bug here, that would probably be my fault unless NVIDIA engineers have since modified the code. I also created the unit test for this function, which likewise would have to be reviewed, since in general, single-input argument float functions are all exhaustively tested, so test escapes should not exist.

As I recall, I made use of the following publication when I created the original implementation: C. Iordache and D.W. Matula, “On infinitely precise rounding for division, square root, reciprocal and square root reciprocal.” In 14th IEEE Symposium on Computer Arithmetic, 1999, pp. 233-240.

njuffa · December 10, 2025, 7:49am

I am not able to reproduce the CUDA result reported above. I first tried CUDA 9.2 on Windows 7, and then CUDA 13.1 on Linux at Compiler Explorer. The results match and are as follows:

arg= 2.90776896e+00    0x1.7431c6p+1  403a18e3
res= 5.86435199e-01    0x1.2c413cp-1  3f16209e

See below for the program code, or visit the link to Compiler Explorer above.

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <math.h>

__global__ void kernel (uint32_t argi)
{
    float arg = __int_as_float (argi);
    printf ("arg=%15.8e  %15.6a  %08x\n", arg, arg, argi);
    float res = __frsqrt_rn (arg);
    uint32_t resi = __float_as_int (res);
    printf ("res=%15.8e  %15.6a  %08x\n", res, res, resi);
}

int main (void)
{
    uint32_t argi = 0x403a18e3;
    kernel<<<1,1>>>(argi);
    cudaDeviceSynchronize();
    return EXIT_SUCCESS;
}

njuffa · December 10, 2025, 8:01am

The paper by Borges I mentioned is

Carlos F. Borges, Claude-Pierre Jeannerod, and Jean-Michel Muller, “High-level algorithms for correctly-rounded reciprocal square roots.” In 29th IEEE Symposium on Computer Arithmetic, 2022, pp. 18-25.

Full disclosure: I briefly communicated with the lead author when this paper first became available as an ArXiv preprint.

cbuchner1 · December 10, 2025, 12:15pm

It would be great if you could post a repro-case with your source code, compiler flags and CUDA version + hardware used. Maybe there is a problem, but it only shows up with certain combinations of the above.

Topic		Replies	Views
Correctly rounded rsqrt in double precision? CUDA Programming and Performance	10	2965	October 15, 2013
Performance tweak for single-precision square root CUDA Programming and Performance	0	1035	March 25, 2021
Help understanding sqrt functions in CUDA CUDA Programming and Performance	2	5109	May 11, 2012
Implementation of sqrt CUDA Programming and Performance	3	11271	November 8, 2007
is reciprocal square root directly accessible? CUDA Programming and Performance	1	1436	September 8, 2009
Integer square root CUDA Programming and Performance	3	1713	February 18, 2025
Weekend project: Exploring the feasibility of replacing MUFU.RCP and MUFU. RSQ CUDA Programming and Performance	0	145	October 1, 2025
CUDA innacuracy? CUDA float produces different result from CPU float CUDA Programming and Performance	8	3160	September 9, 2011
sqrt precision CUDA Programming and Performance	3	19581	September 9, 2011
Huge instruction stream for reciprocal on CC 2.0 reciprocal operation side effect? CUDA Programming and Performance	8	5557	September 13, 2011

__frsqrt_rn is not accurate 0.5ulp? I found a number

Related topics