Hi all !

I’m running into troubles with a simple calculation in double precision. I’m not getting the same results than the same operation performed on the host.

Here is a simple repro case:

```
#include <cstdio>
#include <cuda.h>
void __global__ kernel(double a, double b, double c)
{
printf("a*b : %.25e \n", a*b);
printf("a*b - c : %.25e \n", a*b-c);
printf("a*b - 1 : %.25e \n", a*b-1.);
printf("NO FMA : %.25e \n", __dadd_rn(a*b , -c));
const double mult = a*b;
printf("(a*b)-c : %.25e \n", mult - c);
}
int main()
{
const double a = 9.9826421385793059926072601e+02;
const double b = 1.0018032458425163780391109e-03;
const double c = 1.;
printf("--- GPU ---\n");
kernel<<<1,1>>>(a,b,c);
cudaThreadSynchronize();
printf("--- CPU ---\n");
printf("a*b : %.25e \n", a*b);
printf("a*b - c: %.25e \n", a*b-c);
printf("a*b - 1: %.25e \n", a*b-1.);
return 0;
}
```

And here is what I get on my computer with my C2050:

```
nvcc test.cu -arch=sm_20 -o test
./test
--- GPU ---
a*b : 1.0000643296513027635796789e+00
a*b - c : 6.4329651302793110635540319e-05
a*b - 1 : 6.4329651302793110635540319e-05
NO FMA : 6.4329651302763579678867245e-05
(a*b)-c : 6.4329651302763579678867245e-05
--- CPU ---
a*b : 1.0000643296513027635796789e+00
a*b - c: 6.4329651302763579678867245e-05
a*b - 1: 6.4329651302763579678867245e-05
```

As one can see, the correct result is obtained only when using _dadd_rn or when using a temporary variable. Is this the correct behaviour ?

If I understand correctly the programming guide, the calculation shall always use the IEEE compliant rounding mode no matter if using FMA or not . Correct ?

Is there a way to avoid FMA at compile time ?

Thanks in advance for any suggestion.