printf affecting result of subtraction

Greetings,

a subtraction that should result in zero gives ~1.0e-20 instead.
This result changes to zero if I put printf statements inside the kernel.
I’m completely at a loss why, any ideas?

#include <stdio.h>

//
// Nearly minimal CUDA example.
// Compile with:
//
// nvcc -o example example.cu
//

__global__
void add_wrong(const double *arr1, double *arr2, const int N)
{
    const int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N)
    {
	const double n1 = arr1[0];
	const double n2 = arr1[1];
	const double n3 = arr1[2];
	
	const double result = n1*n2 - n3*n2;
	
	if(i == 1)
	{
		const double a = n1*n2;
		const double b = n3*n2;
		printf("GPU wrong:	id = %d	a = %.e			a - b = %.e\n", i, a, a - b);
    	}
	arr2[i] = result;    

    }
}

__global__
void add_correct(const double *arr1, double *arr2, const int N)
{
    const int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N)
    {
	const double n1 = arr1[0];
	const double n2 = arr1[1];
	const double n3 = arr1[2];
	
	const double result = n1*n2 - n3*n2;

	if(i == 1)
	{
		const double a = n1*n2;
		const double b = n3*n2;

		printf("GPU correct:	id = %d	a = %.e	b = %.e	a - b = %.e\n", i, a, b, a-b);
    	}
	arr2[i] = result;    

    }
}
int main() {
    cudaSetDevice(0);
    // Number of elements
    const int N = 3;
    //
    // Create int arrays on the CPU.
    // ('h' stands for "host".)
    //
    double ha[N];
    double hb[N];
    //
    // Create corresponding int arrays on the GPU.
    // ('d' stands for "device".)
    //
    double *da;
    double *db;
    cudaMalloc((void **)&da, N*sizeof(double));
    cudaMalloc((void **)&db, N*sizeof(double));

    //
    // Initialise the input data on the CPU.
    //
	ha[0] = 0.1;
	ha[1] = 0.2;
	ha[2] = 0.1;

    //
    // Copy input data to array on GPU.
    //
    cudaMemcpy(da, ha, N*sizeof(double), cudaMemcpyHostToDevice);

    //
    // Launch GPU code with N threads, one per
    // array element.
    //
    add_wrong<<<N, 1>>>(da, db, N);
    cudaDeviceSynchronize();
    //
    // Copy output array from GPU back to CPU.
    //
    cudaMemcpy(hb, db, N*sizeof(double), cudaMemcpyDeviceToHost);

    for (int i = 0; i<N; ++i) {
        printf("CPU:	%d	 %e\n", i, hb[i]);
    }

    printf("\n\n");

//
    // Launch GPU code with N threads, one per
    // array element.
    //
    add_correct<<<N, 1>>>(da, db, N);
    cudaDeviceSynchronize();
    //
    // Copy output array from GPU back to CPU.
    //
    cudaMemcpy(hb, db, N*sizeof(double), cudaMemcpyDeviceToHost);
    printf("\n\n");
    for (int i = 0; i<N; ++i) {
        printf("CPU:	%d	 %e\n", i, hb[i]);
    }

    //
    // Free up the arrays on the GPU.
    //
    cudaFree(da);
    cudaFree(db);

    return 0;
}

The important difference is that the correct kernel outputs both variables a and b and the incorrect kernel does not.

Output:
GPU wrong:	id = 1	a = 2e-02			a - b = 2e-18
CPU:	0	 1.665335e-18
CPU:	1	 1.665335e-18
CPU:	2	 1.665335e-18

GPU correct:	id = 1	a = 2e-02	b = 2e-02	a - b = 0e+00

CPU:	0	 0.000000e+00
CPU:	1	 0.000000e+00
CPU:	2	 0.000000e+00

Edit:
OS: Ubuntu 16.04.3 LTS
Driver Version: 384.90
Cuda compilation tools, release 9.0, V9.0.176

This is one situation where FMA contraction can bite you in the behind. FMA is the fused multiply-add operation which is the basic building block of modern floating-point arithmetic. That likely includes your CPU, if it is Haswell or a later architecture. It compute ab+c (or ab-c) using the full product and a single rounding after the addition.

FMA contraction means that the compiler contracts t=ab followed by t+c into fma(ab+c). This transformation is turned on by default in the CUDA toolchain, and you should be able to turn it on in your host toolchain if you have an FMA-capable CPU. Normally, FMA contraction is a good thing, both for performance and for accuracy. However, the computation ab-cd is a classical counterexample. When this is transformed into fma(a,b,cd), the product ab will enter into the subtraction exactly (unrounded), while the product c*d will be rounded, and in general, the two products won’t be exactly the same due to this difference, even if they are mathematically identical.

In your example, the use of printf() interferes with the FMA contraction, which is why you observe the expected zero result there.

You can turn off FMA contraction globally, by passing the command line switch -fmad=false to the CUDA compiler. Obviously, this will affect all code in the compilation unit, and this is not typically one would want to do outside of debugging situations… If you want to address the issue with surgical precision, you can prevent FMA contraction by using intrinsics for the multiplication and/or subtraction:

__dsub_rn (__dmul_rn (a, b), __dmul (c, d)) // ab - cd

If you want a robust implementation for this specific computation that takes full advantage of FMA, you could try this (this is the ‘float’ version, the ‘double’ version is analogous):

/*
  Compute a*b-c*d with error < 1.5 ulp

  Claude-Pierre Jeannerod, Nicolas Louvet, and Jean-Michel Muller, 
  "Further Analysis of Kahan's Algorithm for the Accurate Computation of 2x2 Determinants". 
  Mathematics of Computation, Vol. 82, No. 284, Oct. 2013, pp. 2245-2264
*/
__forceinline__ __device__ float diff_of_products (float a, float b, float c, float d)
{
    float w = d * c;
    float e = fmaf (-d, c, w);
    float f = fmaf (a, b, -w);
    return f + e;
}

@ njuffa,

Thanks for the detailed answer. Your solutions are exactly what I needed.