FMA precision issue

Hi all !

I’m running into troubles with a simple calculation in double precision. I’m not getting the same results than the same operation performed on the host.

Here is a simple repro case:

#include <cstdio>

#include <cuda.h>

void __global__ kernel(double a, double b, double c)

{

printf("a*b      : %.25e \n", a*b);

    printf("a*b - c  : %.25e \n", a*b-c);

    printf("a*b - 1  : %.25e \n", a*b-1.);

     printf("NO FMA   : %.25e \n", __dadd_rn(a*b , -c));

const double mult = a*b;

    printf("(a*b)-c  : %.25e \n", mult - c);

}

int main()

{

const double a = 9.9826421385793059926072601e+02;

    const double b = 1.0018032458425163780391109e-03;

    const double c = 1.;

printf("--- GPU ---\n");

kernel<<<1,1>>>(a,b,c);

    cudaThreadSynchronize();

printf("--- CPU ---\n");

printf("a*b    : %.25e \n", a*b);

    printf("a*b - c: %.25e \n", a*b-c);

    printf("a*b - 1: %.25e \n", a*b-1.);

return 0;

}

And here is what I get on my computer with my C2050:

nvcc test.cu -arch=sm_20 -o test

./test

--- GPU ---

a*b      : 1.0000643296513027635796789e+00 

a*b - c  : 6.4329651302793110635540319e-05 

a*b - 1  : 6.4329651302793110635540319e-05 

NO FMA   : 6.4329651302763579678867245e-05 

(a*b)-c  : 6.4329651302763579678867245e-05 

--- CPU ---

a*b    : 1.0000643296513027635796789e+00 

a*b - c: 6.4329651302763579678867245e-05 

a*b - 1: 6.4329651302763579678867245e-05

As one can see, the correct result is obtained only when using _dadd_rn or when using a temporary variable. Is this the correct behaviour ?

If I understand correctly the programming guide, the calculation shall always use the IEEE compliant rounding mode no matter if using FMA or not . Correct ?

Is there a way to avoid FMA at compile time ?

Thanks in advance for any suggestion.

Hi all !

I’m running into troubles with a simple calculation in double precision. I’m not getting the same results than the same operation performed on the host.

Here is a simple repro case:

#include <cstdio>

#include <cuda.h>

void __global__ kernel(double a, double b, double c)

{

printf("a*b      : %.25e \n", a*b);

    printf("a*b - c  : %.25e \n", a*b-c);

    printf("a*b - 1  : %.25e \n", a*b-1.);

     printf("NO FMA   : %.25e \n", __dadd_rn(a*b , -c));

const double mult = a*b;

    printf("(a*b)-c  : %.25e \n", mult - c);

}

int main()

{

const double a = 9.9826421385793059926072601e+02;

    const double b = 1.0018032458425163780391109e-03;

    const double c = 1.;

printf("--- GPU ---\n");

kernel<<<1,1>>>(a,b,c);

    cudaThreadSynchronize();

printf("--- CPU ---\n");

printf("a*b    : %.25e \n", a*b);

    printf("a*b - c: %.25e \n", a*b-c);

    printf("a*b - 1: %.25e \n", a*b-1.);

return 0;

}

And here is what I get on my computer with my C2050:

nvcc test.cu -arch=sm_20 -o test

./test

--- GPU ---

a*b      : 1.0000643296513027635796789e+00 

a*b - c  : 6.4329651302793110635540319e-05 

a*b - 1  : 6.4329651302793110635540319e-05 

NO FMA   : 6.4329651302763579678867245e-05 

(a*b)-c  : 6.4329651302763579678867245e-05 

--- CPU ---

a*b    : 1.0000643296513027635796789e+00 

a*b - c: 6.4329651302763579678867245e-05 

a*b - 1: 6.4329651302763579678867245e-05

As one can see, the correct result is obtained only when using _dadd_rn or when using a temporary variable. Is this the correct behaviour ?

If I understand correctly the programming guide, the calculation shall always use the IEEE compliant rounding mode no matter if using FMA or not . Correct ?

Is there a way to avoid FMA at compile time ?

Thanks in advance for any suggestion.

Isn’t that an error of many ulp ?

Isn’t that an error of many ulp ?

Well, in this case it’s most likely your host that’s not entirely IEEE compliant. The IEEE standard is defined for 64-bit doubles. However, if you don’t use SSE on Intel x86 CPUs, all fp arithmetic is done using 80-bit registers (using the x87 path). You should check whether the additional 16 bits of mantissa make a difference for you particular numbers. Also, there are compiler flags (definitely for icc, probably for gcc as well) that enforce 64-bit arithmetic. Or, you could try SSE intrinsics (supported by pretty much all C compilers today) on the host and check the result again.

Well, in this case it’s most likely your host that’s not entirely IEEE compliant. The IEEE standard is defined for 64-bit doubles. However, if you don’t use SSE on Intel x86 CPUs, all fp arithmetic is done using 80-bit registers (using the x87 path). You should check whether the additional 16 bits of mantissa make a difference for you particular numbers. Also, there are compiler flags (definitely for icc, probably for gcc as well) that enforce 64-bit arithmetic. Or, you could try SSE intrinsics (supported by pretty much all C compilers today) on the host and check the result again.

Sorry if the formatting should be off in the following, I am still adjusting to the new forum software.

This is a classical example where FMA (fused multiply-add) delivers a more accurate result than separate, individually rounded, FMUL and FADD operations. The product a * b is close in magnitude to c. If the product is first rounded to double precision (via FMUL), then c is subtracted, the leading mantissa bits cancel during mantissa subtraction, and during renormalization of the mantissa, zero bits are shifted in from the right (least significant) side. These zero bits carry no information.

x = 9.9826421385793060e+002 (408f321d1c27b7b2)

   y = 1.0018032458425164e-003 (3f5069de0b630ba4)

prod = 1.0000643296513028e+000 (3ff00043745bf9e4)  // double-precision FMUL, round-to-nearest-even

 sum = 6.4329651302763580e-005 (3f10dd16fe790000)  // double-precision FADD, round-to-nearest-even

One can clearly see the trailing zero bits in the hex representation of the final sum. FMA helps with this problem of subtractive cancelation by retaining all mantissa bits of the intermediate product, without any rounding or truncation, and propagagting them into the addition. As leading mantissa bits cancel during the effective subtraction, valid result bits (instead of zero bits) are pulled in at the right (least significant) side. Rounding then occurs only once, to the result of the final addition.

x = 9.9826421385793060e+002 (408f321d1c27b7b2)

   y = 1.0018032458425164e-003 (3f5069de0b630ba4)

 fma = 6.4329651302793111e-005 (3f10dd16fe790883)  // double-precision FMA, round-to-nearest-even

The CPU results above reflect the result of two separate IEEE-754-compliant operations, double-precision FMUL followed by double-precision FADD. The GPU results demonstrate the difference between the two alternative ways of computation available on the GPU: IEEE-754-compliant double-precision FMUL followed by IEEE-754-compliant double-precision FADD: the latter two results. IEEE-754-compliant double-precision FMA: the two results prior to the last two. In other words:

a*b - c  : 6.4329651302793110635540319e-05 // uses FMA

a*b - 1  : 6.4329651302793110635540319e-05 // uses FMA

NO FMA   : 6.4329651302763579678867245e-05 // uses FMUL, FADD

(a*b)-c  : 6.4329651302763579678867245e-05 // uses FMUL, FADD

In general, the compiler will try to merge FMUL and FADD into an FMA at full optimization. If one needs finer control as to where this merging occurs and where not, one can use the __fma_rn() device function to force the use of FMA, and the __dmul_rn(), __dadd_rn() device functions to force separate FMUL, FADD instructions. The CUDA math library is an example of software where __fma_rn() is used extensively to control where FMAs are being deployed.

1 Like

Sorry if the formatting should be off in the following, I am still adjusting to the new forum software.

This is a classical example where FMA (fused multiply-add) delivers a more accurate result than separate, individually rounded, FMUL and FADD operations. The product a * b is close in magnitude to c. If the product is first rounded to double precision (via FMUL), then c is subtracted, the leading mantissa bits cancel during mantissa subtraction, and during renormalization of the mantissa, zero bits are shifted in from the right (least significant) side. These zero bits carry no information.

x = 9.9826421385793060e+002 (408f321d1c27b7b2)

   y = 1.0018032458425164e-003 (3f5069de0b630ba4)

prod = 1.0000643296513028e+000 (3ff00043745bf9e4)  // double-precision FMUL, round-to-nearest-even

 sum = 6.4329651302763580e-005 (3f10dd16fe790000)  // double-precision FADD, round-to-nearest-even

One can clearly see the trailing zero bits in the hex representation of the final sum. FMA helps with this problem of subtractive cancelation by retaining all mantissa bits of the intermediate product, without any rounding or truncation, and propagagting them into the addition. As leading mantissa bits cancel during the effective subtraction, valid result bits (instead of zero bits) are pulled in at the right (least significant) side. Rounding then occurs only once, to the result of the final addition.

x = 9.9826421385793060e+002 (408f321d1c27b7b2)

   y = 1.0018032458425164e-003 (3f5069de0b630ba4)

 fma = 6.4329651302793111e-005 (3f10dd16fe790883)  // double-precision FMA, round-to-nearest-even

The CPU results above reflect the result of two separate IEEE-754-compliant operations, double-precision FMUL followed by double-precision FADD. The GPU results demonstrate the difference between the two alternative ways of computation available on the GPU: IEEE-754-compliant double-precision FMUL followed by IEEE-754-compliant double-precision FADD: the latter two results. IEEE-754-compliant double-precision FMA: the two results prior to the last two. In other words:

a*b - c  : 6.4329651302793110635540319e-05 // uses FMA

a*b - 1  : 6.4329651302793110635540319e-05 // uses FMA

NO FMA   : 6.4329651302763579678867245e-05 // uses FMUL, FADD

(a*b)-c  : 6.4329651302763579678867245e-05 // uses FMUL, FADD

In general, the compiler will try to merge FMUL and FADD into an FMA at full optimization. If one needs finer control as to where this merging occurs and where not, one can use the __fma_rn() device function to force the use of FMA, and the __dmul_rn(), __dadd_rn() device functions to force separate FMUL, FADD instructions. The CUDA math library is an example of software where __fma_rn() is used extensively to control where FMAs are being deployed.

Thanks for this comprehensive answer.

I guess I should be glad the GPU is more precise than my CPU. :)

Thanks for this comprehensive answer.

I guess I should be glad the GPU is more precise than my CPU. :)