Precision Fail

Hello!

I have the next code:

#include <stdio.h>

#include <math.h>

__global__ void  polevl_kernel(double x,double *ans){

  double ans2;

  ans2 = double(-59.9633501014107895267)*x + double(98.0010754185999661536);

  ans2 = ans2*x + double(-56.6762857469070293439);

  ans2 = ans2*x + double(13.9312609387279679503);

  ans2 = ans2*x + double(-1.23916583867381258016);

  *ans = ans2;

}

double polevl_gpu(double x) {

  double *ans_gpu;

  double *ans=(double *)malloc(sizeof(double));

cudaMalloc((void **)&ans_gpu, sizeof(double));

dim3 Dimi_grid(1);

  dim3 DimYBlock(1);

polevl_kernel<<< Dimi_grid, DimYBlock >>>(x,ans_gpu);

cudaMemcpy(ans, ans_gpu, sizeof(double), cudaMemcpyDeviceToHost);

  	

  cudaFree(ans_gpu);

  cudaThreadExit();

return *ans;

}

double polevl(double x) {

  double ans_cpu;

ans_cpu = double(-59.9633501014107895267)*x + double(98.0010754185999661536);

  ans_cpu = ans_cpu*x + double(-56.6762857469070293439);

  ans_cpu = ans_cpu*x + double(13.9312609387279679503);

  ans_cpu = ans_cpu*x+ double(-1.23916583867381258016);

return ans_cpu;

}

int main(int argc, char *argv[]){

  double ans_cpu, ans_gpu;

  double d=0.12873647402927101968117540309322066605091094970703;

ans_gpu=polevl_gpu(d);

  ans_cpu=polevl(d);

printf("GPU Result: %.50f\n",ans_gpu);

  printf("CPU Result: %.50f\n",ans_cpu);

return 0;

}

when i execute it, obtain the next output:

GPU Result: -0.19238382202673767751299749306781450286507606506348

CPU Result: -0.19238382202673776077972433995455503463745117187500

The test have been done in:

    Tesla C2050 and GeForce GTX 465.

    Cuda compilation tools 3.2.

The fail persist in simple precision, and only occur with negative numbers.

Someone know why does this fail happens?

Thnaks a lot!

The results differ only in the least significant bit. By all practical means, they are equal.

For more information about floating point arithmetics and rounding, google for “What every computer scientist should know about floating-point Arithmetic” by David Goldberg.

Least significants bits are very important if it is a medical aplication.

You are, I guess, assuming that the CPU is correct and the GPU is wrong. That might not actually be the case.

When you get to the level of the valid last digit of the significand, a single additional rounding operation can explain the difference. The GPU compiler is probably generating IEEE 754-2008 compliant fused multiply adds, the CPU compiler is probably not. The only way to be sure is to write the arithmetic using primitives with the same rounding modes in both cases, and then disassemble the resulting code for both devices and compare them.

If the last bits of a double precision result are ‘very important’ in a medical application, then that is a medical application which I don’t want anything to do with - because the calculation is done on noise. For example, if the unit of distance is metres, then the last bit of double precision (coming in at the 10[sup]-14[/sup] level) is roughly the radius of a uranium nucleus. Even single precision gets you to less than the thickness of a human hair. I somehow doubt that the input data will justify even the lower level of precision.

The numerical difference between the CPU and the GPU in this code is due to the last expression in polevl. This involves an effective subtraction of the constant from the rounded product ans_cpu*x. This leads to a mild case of subtractive cancellation on the CPU, as the operands are fairly close. The CPU computation thus “loses” 3 bits in this computation, i.e. the last three bits of the final result are zero bits carrying no useful information.

The GPU on the other hand evaluates this expression with a double-precision FMA (fused multiply-add), in which the unrounded product is retained for the effective subtraction, before rounding occurs. Thus the GPU result does not “lose” any bits, i.e. all bits of the final result carry useful information.

To simulate the lower-accuracy CPU computation on the GPU, one could explicitly code double-precision multiplies and adds at source level using the appropriate intrinsics provided by CUDA. This prevents the compiler from contracting these operations into a fused multiply-add:

#if 0

  ans2 = ans2*x + double(-1.23916583867381258016);

#else

  ans2 = __dadd_rn (__dmul_rn (ans2, x), double(-1.23916583867381258016)); 

#endif

The following shows the details of the expression evaluation. One can clearly see why the GPU result differs from the CPU result:

CPU

===

//

   4020432ca5391904   8.1311999923896181e+00

*  3fc07a6fd0e10884   1.2873647402927102e-01

--------------------------------------------

=  3ff0bf9e7ff8a44e   1.0467820166470747e+00

//

   3ff0bf9e7ff8a44e   1.0467820166470747e+00

+  bff3d39f8ef6ca7e  -1.2391658386738125e+00

--------------------------------------------

=  bfc8a00877f13180  -1.9238382202673776e-01

//

details of effective subtraction:

//

   1.0BF9E7FF8A44E * 2^0

-  1.3D39f8EF6CA7E * 2^0

------------------------

= -0.314010EFE2630 * 2^0  (normalize by 3-bit left shift)

========================

= -1.8A00877F13180 * 2^-3

//

//

GPU with FMA

============

//

   4020432ca5391904   8.1311999923896181e+00

*  3fc07a6fd0e10884   1.2873647402927102e-01

--------------------------------------------

=  3ff0bf9e7ff8a44e_5646d36bd061  (unrounded intermediate product)

//

   3ff0bf9e7ff8a44e_5646d36bd061

+  bff3d39f8ef6ca7e  -1.2391658386738125e+00

--------------------------------------------

=  bfc8a00877f1317d  -1.9238382202673768e-01

//

details of effective subtraction:

//

   1.0BF9E7FF8A44E_5646d36bd061 * 2^0

-  1.3D39f8EF6CA7E_000000000000 * 2^0

-------------------------------------

= -0.314010EFE262F_A9B92C942F9F * 2^0  (normalize by 3-bit left shift)

=====================================

= -1.8A00877F1317D_4DC......... * 2^-3

=====================================

= -1.8A00877F1317D * 2^-3