FMA precision issue

Magorath · November 19, 2010, 11:54am

Hi all !

I’m running into troubles with a simple calculation in double precision. I’m not getting the same results than the same operation performed on the host.

Here is a simple repro case:

#include <cstdio>

#include <cuda.h>

void __global__ kernel(double a, double b, double c)

{

printf("a*b      : %.25e \n", a*b);

    printf("a*b - c  : %.25e \n", a*b-c);

    printf("a*b - 1  : %.25e \n", a*b-1.);

     printf("NO FMA   : %.25e \n", __dadd_rn(a*b , -c));

const double mult = a*b;

    printf("(a*b)-c  : %.25e \n", mult - c);

}

int main()

{

const double a = 9.9826421385793059926072601e+02;

    const double b = 1.0018032458425163780391109e-03;

    const double c = 1.;

printf("--- GPU ---\n");

kernel<<<1,1>>>(a,b,c);

    cudaThreadSynchronize();

printf("--- CPU ---\n");

printf("a*b    : %.25e \n", a*b);

    printf("a*b - c: %.25e \n", a*b-c);

    printf("a*b - 1: %.25e \n", a*b-1.);

return 0;

}

And here is what I get on my computer with my C2050:

nvcc test.cu -arch=sm_20 -o test

./test

--- GPU ---

a*b      : 1.0000643296513027635796789e+00 

a*b - c  : 6.4329651302793110635540319e-05 

a*b - 1  : 6.4329651302793110635540319e-05 

NO FMA   : 6.4329651302763579678867245e-05 

(a*b)-c  : 6.4329651302763579678867245e-05 

--- CPU ---

a*b    : 1.0000643296513027635796789e+00 

a*b - c: 6.4329651302763579678867245e-05 

a*b - 1: 6.4329651302763579678867245e-05

As one can see, the correct result is obtained only when using _dadd_rn or when using a temporary variable. Is this the correct behaviour ?

If I understand correctly the programming guide, the calculation shall always use the IEEE compliant rounding mode no matter if using FMA or not . Correct ?

Is there a way to avoid FMA at compile time ?

Thanks in advance for any suggestion.

Magorath · November 19, 2010, 11:54am

Hi all !

I’m running into troubles with a simple calculation in double precision. I’m not getting the same results than the same operation performed on the host.

Here is a simple repro case:

#include <cstdio>

#include <cuda.h>

void __global__ kernel(double a, double b, double c)

{

printf("a*b      : %.25e \n", a*b);

    printf("a*b - c  : %.25e \n", a*b-c);

    printf("a*b - 1  : %.25e \n", a*b-1.);

     printf("NO FMA   : %.25e \n", __dadd_rn(a*b , -c));

const double mult = a*b;

    printf("(a*b)-c  : %.25e \n", mult - c);

}

int main()

{

const double a = 9.9826421385793059926072601e+02;

    const double b = 1.0018032458425163780391109e-03;

    const double c = 1.;

printf("--- GPU ---\n");

kernel<<<1,1>>>(a,b,c);

    cudaThreadSynchronize();

printf("--- CPU ---\n");

printf("a*b    : %.25e \n", a*b);

    printf("a*b - c: %.25e \n", a*b-c);

    printf("a*b - 1: %.25e \n", a*b-1.);

return 0;

}

And here is what I get on my computer with my C2050:

nvcc test.cu -arch=sm_20 -o test

./test

--- GPU ---

a*b      : 1.0000643296513027635796789e+00 

a*b - c  : 6.4329651302793110635540319e-05 

a*b - 1  : 6.4329651302793110635540319e-05 

NO FMA   : 6.4329651302763579678867245e-05 

(a*b)-c  : 6.4329651302763579678867245e-05 

--- CPU ---

a*b    : 1.0000643296513027635796789e+00 

a*b - c: 6.4329651302763579678867245e-05 

a*b - 1: 6.4329651302763579678867245e-05

As one can see, the correct result is obtained only when using _dadd_rn or when using a temporary variable. Is this the correct behaviour ?

If I understand correctly the programming guide, the calculation shall always use the IEEE compliant rounding mode no matter if using FMA or not . Correct ?

Is there a way to avoid FMA at compile time ?

Thanks in advance for any suggestion.

Magorath · November 19, 2010, 3:03pm

Isn’t that an error of many ulp ?

Magorath · November 19, 2010, 3:03pm

Isn’t that an error of many ulp ?

paulius · November 19, 2010, 8:03pm

Well, in this case it’s most likely your host that’s not entirely IEEE compliant. The IEEE standard is defined for 64-bit doubles. However, if you don’t use SSE on Intel x86 CPUs, all fp arithmetic is done using 80-bit registers (using the x87 path). You should check whether the additional 16 bits of mantissa make a difference for you particular numbers. Also, there are compiler flags (definitely for icc, probably for gcc as well) that enforce 64-bit arithmetic. Or, you could try SSE intrinsics (supported by pretty much all C compilers today) on the host and check the result again.

paulius · November 19, 2010, 8:03pm

Well, in this case it’s most likely your host that’s not entirely IEEE compliant. The IEEE standard is defined for 64-bit doubles. However, if you don’t use SSE on Intel x86 CPUs, all fp arithmetic is done using 80-bit registers (using the x87 path). You should check whether the additional 16 bits of mantissa make a difference for you particular numbers. Also, there are compiler flags (definitely for icc, probably for gcc as well) that enforce 64-bit arithmetic. Or, you could try SSE intrinsics (supported by pretty much all C compilers today) on the host and check the result again.

njuffa · November 19, 2010, 9:58pm

Sorry if the formatting should be off in the following, I am still adjusting to the new forum software.

This is a classical example where FMA (fused multiply-add) delivers a more accurate result than separate, individually rounded, FMUL and FADD operations. The product a * b is close in magnitude to c. If the product is first rounded to double precision (via FMUL), then c is subtracted, the leading mantissa bits cancel during mantissa subtraction, and during renormalization of the mantissa, zero bits are shifted in from the right (least significant) side. These zero bits carry no information.

x = 9.9826421385793060e+002 (408f321d1c27b7b2)

   y = 1.0018032458425164e-003 (3f5069de0b630ba4)

prod = 1.0000643296513028e+000 (3ff00043745bf9e4)  // double-precision FMUL, round-to-nearest-even

 sum = 6.4329651302763580e-005 (3f10dd16fe790000)  // double-precision FADD, round-to-nearest-even

One can clearly see the trailing zero bits in the hex representation of the final sum. FMA helps with this problem of subtractive cancelation by retaining all mantissa bits of the intermediate product, without any rounding or truncation, and propagagting them into the addition. As leading mantissa bits cancel during the effective subtraction, valid result bits (instead of zero bits) are pulled in at the right (least significant) side. Rounding then occurs only once, to the result of the final addition.

x = 9.9826421385793060e+002 (408f321d1c27b7b2)

   y = 1.0018032458425164e-003 (3f5069de0b630ba4)

 fma = 6.4329651302793111e-005 (3f10dd16fe790883)  // double-precision FMA, round-to-nearest-even

The CPU results above reflect the result of two separate IEEE-754-compliant operations, double-precision FMUL followed by double-precision FADD. The GPU results demonstrate the difference between the two alternative ways of computation available on the GPU: IEEE-754-compliant double-precision FMUL followed by IEEE-754-compliant double-precision FADD: the latter two results. IEEE-754-compliant double-precision FMA: the two results prior to the last two. In other words:

a*b - c  : 6.4329651302793110635540319e-05 // uses FMA

a*b - 1  : 6.4329651302793110635540319e-05 // uses FMA

NO FMA   : 6.4329651302763579678867245e-05 // uses FMUL, FADD

(a*b)-c  : 6.4329651302763579678867245e-05 // uses FMUL, FADD

In general, the compiler will try to merge FMUL and FADD into an FMA at full optimization. If one needs finer control as to where this merging occurs and where not, one can use the __fma_rn() device function to force the use of FMA, and the __dmul_rn(), __dadd_rn() device functions to force separate FMUL, FADD instructions. The CUDA math library is an example of software where __fma_rn() is used extensively to control where FMAs are being deployed.

njuffa · November 19, 2010, 9:58pm

Sorry if the formatting should be off in the following, I am still adjusting to the new forum software.

This is a classical example where FMA (fused multiply-add) delivers a more accurate result than separate, individually rounded, FMUL and FADD operations. The product a * b is close in magnitude to c. If the product is first rounded to double precision (via FMUL), then c is subtracted, the leading mantissa bits cancel during mantissa subtraction, and during renormalization of the mantissa, zero bits are shifted in from the right (least significant) side. These zero bits carry no information.

x = 9.9826421385793060e+002 (408f321d1c27b7b2)

   y = 1.0018032458425164e-003 (3f5069de0b630ba4)

prod = 1.0000643296513028e+000 (3ff00043745bf9e4)  // double-precision FMUL, round-to-nearest-even

 sum = 6.4329651302763580e-005 (3f10dd16fe790000)  // double-precision FADD, round-to-nearest-even

One can clearly see the trailing zero bits in the hex representation of the final sum. FMA helps with this problem of subtractive cancelation by retaining all mantissa bits of the intermediate product, without any rounding or truncation, and propagagting them into the addition. As leading mantissa bits cancel during the effective subtraction, valid result bits (instead of zero bits) are pulled in at the right (least significant) side. Rounding then occurs only once, to the result of the final addition.

x = 9.9826421385793060e+002 (408f321d1c27b7b2)

   y = 1.0018032458425164e-003 (3f5069de0b630ba4)

 fma = 6.4329651302793111e-005 (3f10dd16fe790883)  // double-precision FMA, round-to-nearest-even

The CPU results above reflect the result of two separate IEEE-754-compliant operations, double-precision FMUL followed by double-precision FADD. The GPU results demonstrate the difference between the two alternative ways of computation available on the GPU: IEEE-754-compliant double-precision FMUL followed by IEEE-754-compliant double-precision FADD: the latter two results. IEEE-754-compliant double-precision FMA: the two results prior to the last two. In other words:

a*b - c  : 6.4329651302793110635540319e-05 // uses FMA

a*b - 1  : 6.4329651302793110635540319e-05 // uses FMA

NO FMA   : 6.4329651302763579678867245e-05 // uses FMUL, FADD

(a*b)-c  : 6.4329651302763579678867245e-05 // uses FMUL, FADD

In general, the compiler will try to merge FMUL and FADD into an FMA at full optimization. If one needs finer control as to where this merging occurs and where not, one can use the __fma_rn() device function to force the use of FMA, and the __dmul_rn(), __dadd_rn() device functions to force separate FMUL, FADD instructions. The CUDA math library is an example of software where __fma_rn() is used extensively to control where FMAs are being deployed.

Magorath · November 21, 2010, 1:52pm

Thanks for this comprehensive answer.

I guess I should be glad the GPU is more precise than my CPU. :)

Magorath · November 21, 2010, 1:52pm

Thanks for this comprehensive answer.

I guess I should be glad the GPU is more precise than my CPU. :)

Topic		Replies	Views
GPU Code and CPU Code output not matching till machine precision (i.e. 13 decimals places) CUDA Programming and Performance	22	778	August 9, 2023
When using the __hfma function, the Device and Host results differ by 1 ULP, which should be 0ulp in theory CUDA Programming and Performance cuda , kernel	8	79	December 12, 2024
Unacceptably low accuracy of calculations on the GPU CUDA Programming and Performance cuda	12	1529	May 15, 2024
Emulated double precision Double single routine header CUDA Programming and Performance	24	49127	October 18, 2010
Float precision error in matrix multiplication application. CUDA Programming and Performance	14	3550	February 27, 2014
Multiplication gives 24bit results. How to get 32? CUDA Programming and Performance	11	3490	March 1, 2009
Why are the calculations different between CPU and GPU? CUDA Programming and Performance	2	848	February 7, 2020
Floating Point Accuracy CUDA Programming and Performance	11	30424	April 6, 2013
Several questions on cuda (arithmetic, rounding, for loop ad performance) CUDA Programming and Performance	8	3372	April 13, 2023
precision CUDA Programming and Performance	3	2614	December 16, 2008

FMA precision issue

Related topics