sqrt precision

Recently, I tried to use the cuda sqrt function to get the square root value of an integer of long long type

and I am disappointed in its precision.

for example:

the cpu version
The square root of 625 is 25
The square root of 10616005156 is 103034
The square root of 3014768326492836 is 54906906
The square root of 24597324665761 is 4959569
The square root of 15594601 is 3949

the gpu version
The square root of 625 is 25
The square root of 10616005156 is 103033
The square root of 3014768326492836 is 54906904
The square root of 24597324665761 is 4959569
The square root of 15594601 is 3949

As you can see from my experiment, the cpu and gpu version is a little different, and this causes trouble to me

because I need at least the precision like the cpu version.

I think the main problem is the implementation of the cuda sqrt function.

it is sqrt(x) = 1/rsqrt(x);

If anyone has such problem like me, please tell me how to correct it.

Thanks.

An (unsigned) long long uses up to 64 bits to represent an integer. The effective mantissa length of an IEEE-754 single-precision floating-point number is 24 bits, while the effective mantissa length of an IEEE-754 double-precision floating-point number is 53 bits. Clearly not all integers representable by a long long or unsigned long long operand are exactly representable in a double operand, let alone a float operand. x86 CPUs may use extended precision floating-point numbers internally (when targetting the x87 FPU rather than SSEx), and this extended precision format uses 64 mantissa bits, thus allowing all long longs to be stored in an extended precision floating-point operand without loss of accuracy.

Note that on a sm_1x GPU, the single-precision sqrtf() is approximate; to get a single-precision square root rounded in compliance with IEEE-754 round-to-nearest-or-even mode, use the device function __fsqrt_rn() instead. On sm_2x devices, sqrtf() has IEEE-754 compliant round-to-nearest-or-even rounding by default, and the behavior is controllable with the -prec-sqrt={true|false} compiler flag. This flag makes sqrtf() either use IEEE-754 rounding (when -prec-sqrt=true) or approximate (when -prec-sqrt=false). The __frsqrt_rn() device function is available across GPUs of all compute capabilities, but it’s significantly faster on sm_2x devices (due to better HW support). The double precision function sqrt() always rounds according to the IEEE-754 round-to-nearest-or-even mode (on all devices that support double precision, i.e. sm_13 and higher).

When using square root operations with IEEE-754 rounding, and sticking to integer operands that are accurately representable in the selected floating-point format, the same result is achieved on both the CPU and the GPU.

An (unsigned) long long uses up to 64 bits to represent an integer. The effective mantissa length of an IEEE-754 single-precision floating-point number is 24 bits, while the effective mantissa length of an IEEE-754 double-precision floating-point number is 53 bits. Clearly not all integers representable by a long long or unsigned long long operand are exactly representable in a double operand, let alone a float operand. x86 CPUs may use extended precision floating-point numbers internally (when targetting the x87 FPU rather than SSEx), and this extended precision format uses 64 mantissa bits, thus allowing all long longs to be stored in an extended precision floating-point operand without loss of accuracy.

Note that on a sm_1x GPU, the single-precision sqrtf() is approximate; to get a single-precision square root rounded in compliance with IEEE-754 round-to-nearest-or-even mode, use the device function __fsqrt_rn() instead. On sm_2x devices, sqrtf() has IEEE-754 compliant round-to-nearest-or-even rounding by default, and the behavior is controllable with the -prec-sqrt={true|false} compiler flag. This flag makes sqrtf() either use IEEE-754 rounding (when -prec-sqrt=true) or approximate (when -prec-sqrt=false). The __frsqrt_rn() device function is available across GPUs of all compute capabilities, but it’s significantly faster on sm_2x devices (due to better HW support). The double precision function sqrt() always rounds according to the IEEE-754 round-to-nearest-or-even mode (on all devices that support double precision, i.e. sm_13 and higher).

When using square root operations with IEEE-754 rounding, and sticking to integer operands that are accurately representable in the selected floating-point format, the same result is achieved on both the CPU and the GPU.

This is the way I came around this problem,

I couldn’t get -prec-sqrt=true to work, so I went for this algorythm : Babylonian method, Methods of computing square roots - Wikipedia

this works like this:

for calculating square root of S:

x_0 = initial guess;
x[sub]n+1[/sub] = 1/2.0 * ( x[sub]n[/sub] + S / x[sub]n[/sub] );

as for my initial guess I chose sqrtf() from cuda!!
for example below sqrt(1234567891) is calculated:


double x_n;
x_n=sqrtf(1234567891);
double s=1234567891;
for (int tt=1;tt<10;++tt){
x_n=1.0/2.0*(x_n+s/x_n);
}


this is the results:

Matlab on CPU:
3.513641830067487e+004

Babylonian method above with GPU:
3.5136418300674872e+004

Cuda sqrtf:
3.513641796875e+004

as you can see above method reaches great resluts to be compared with cpu in very low number of iterations!!
( I haven’t tried optimizing that tt=10 number of iterations!!)