cuda double multiply error

Problem:
Now I met a very simple but strange error. I try the double type multiply in cuda, when the variable come from register(test0) and global memory(test1), their result are different. But when I multipy x4. (x4 = 1.0), their result become the same(test2 and test3). And when I multiply 1.0 directly(test4 and test5), their result are different again!!

Code:

#include <stdio.h>

_global_
void compute_on_gpu(double *a){

double x0 = a[1];
double x1 = a[6];
double x2 = a[2];
double x3 = a[5];
double x4 = a[15];

if(threadIdx.x == 0){ 
printf("test0: %.16f\n", x0*x1 - x2*x3);
printf("test1: %.16lf\n",a[1]*a[6]- a[2]*a[5]);

printf("test2: %.16f\n", x0*x1*x4 - x2*x3*x4);
printf("test3: %.16lf\n",a[1]*a[6]*a[15] - a[2]*a[5]*a[15]);

printf("test4: %.16f\n", x0*x1 - x2*x3);
printf("test5: %.16lf\n",a[1]*a[6]*1.0 - a[2]*a[5]*1.0);
}

}

int main(){

double a[16];

double *d_a;

cudaMalloc((void **)&d_a, sizeof(double)*16);


a[0] = 20.9952629726873141;
a[1] = 35187.4264233882131521;
a[2] = -95746.0777556623070268;
a[3] = 0.0;

a[4] = 72.2689035191256579;
a[5] = 120868.6083184806921054;
a[6] = -329572.6309625260764733;
a[7] = 0.0;

a[8] = -834.3825760590481195;
a[9] = 0.0000000000000000;
a[10] = -0.0242135821054153;
a[11] = 0.0;

a[12] = 0.0;
a[13] = 0.0;
a[14] = 0.0;
a[15] = 1.0; 

cudaMemcpy(d_a,a,sizeof(double)*16,cudaMemcpyHostToDevice);

compute_on_gpu<<<1,32>>>(d_a);

cudaDeviceReset();

return 0;
}

Result

test0: -24117532.8764190673828125
test1: -24117532.8764189481735229

test2: -24117532.8764190673828125
test3: -24117532.8764190673828125

test4: -24117532.8764190673828125
test5: -24117532.8764189481735229

Question:
Can someone tell me what’s wrong in my code?

Compile with “nvcc –fmad=false”.

Also, just to be sure, use the same format specifiers in all printfs. And use two underscores each side of “global”, not just one (but I am sure you got that right in the code you are actually compiling).

The basic computational building block of a GPU is the fused multiply-add, or FMA for short, that computes a*b+c such that the full unmodified product (neither truncated nor rounded) enters into the addition, with just a single rounding at the end. Compared to a FMUL/FADD sequence, this provides the benefit of higher performance, potentially lower power consumption, and improved average accuracy.

By default, the CUDA compiler will try to contract FADD dependent on FMUL into FMA to reap these benefits, where the selection of the operations to be contracted is implementation dependent and may differ between compiler versions, or based on compiler optimization level.

In most cases, FMA-contraction by default is beneficial, but it can sometimes lead to confusing numerical outcomes (as in this case) and very rarely to harmful scenarios (e.g. causing a small negative argument passed to a square root where the mathematical result is zero). For these cases, the compiler allows programmers to turn off the FMA contraction, as pointed out by tera in #2.

A fairly common idiom affected by FMA that one might want to address locally (as opposed to globally via the compiler switch) to preserve maximum benefits from the FMA capability of the hardware is the difference of two products. For this, literature points out the following robust and very accurate implementation:

/*
  diff_of_products() computes a*b-c*d with a maximum error < 1.5 ulp

  Claude-Pierre Jeannerod, Nicolas Louvet, and Jean-Michel Muller, 
  "Further Analysis of Kahan's Algorithm for the Accurate Computation 
  of 2x2 Determinants". Mathematics of Computation, Vol. 82, No. 284, 
  Oct. 2013, pp. 2245-2264
*/
double diff_of_products (double a, double b, double c, double d)
{
    double w = d * c;
    double e = fma (-d, c, w);
    double f = fma (a, b, -w);
    return f + e;
}

and in case anyone was wondering, the n in njuffa stands for ninja

Legend.