CPU and GPU floating point calculations Results are different

I was working on parallelizing a neural network training algorithm and in testing I noticed that the results in the original version and in the GPU version were slightly different - CPU would learn the same training set using the same algorithm in less iterations than GPU.

Maybe my GPU implementation of the training algorithm is somehow faulty, but nevertheless I started wondering what made the difference, so I made this little test program that simply does nothing but some float number calculations and shows results. (Nevermind the excessive usage of global memory in this example, as the only purpose of it is to see how floating point calculations work out)

Here is the full source code:

#include <stdio.h>

// Float calculation test function for Device

__global__ void floatTestDevice(float *d_A, float *d_B, float *d_C) {

	int tx = threadIdx.x;

	for (int i = 0; i < 10; i++) {

		d_C[tx] += d_A[tx] * d_B[tx];

	}

}

// Float calculation test function for Host

void floatTestHost(float *h_A, float *h_B, float *h_C, int amount) {

	for (int tx = 0; tx < amount; tx++) {

		for (int i = 0; i < 10; i++) {

			h_C[tx] += h_A[tx] * h_B[tx];

		}

	}

}

int main() {

	// Declare variables A, B and C for both Host and Device

	float *h_A, *h_B, *h_C, *h_dC;	// h_dC will contain d_C in Host memory

	float *d_A, *d_B, *d_C;

	int amount = 10;	// Size of arrays and num of iterations for loop

	

	// Allocate space for Host and Device variables

	h_A = (float*)malloc(sizeof(float)*amount);

	h_B = (float*)malloc(sizeof(float)*amount);

	h_C = (float*)malloc(sizeof(float)*amount);

	h_dC = (float*)malloc(sizeof(float)*amount);

	cudaMalloc((void**) &d_A, sizeof(float)*amount);

	cudaMalloc((void**) &d_B, sizeof(float)*amount);

	cudaMalloc((void**) &d_C, sizeof(float)*amount);

	

	// Set some float values to calculate (just random)

	for (int i = 0; i < amount; i++) {

		h_A[i] = 5.125436/(i+1);

		h_B[i] = 8.234534/(i+1);

		h_C[i] = 0;

	}

	

	// Copy initial values from Host to Device

	cudaMemcpy(d_A, h_A, sizeof(float)*amount, cudaMemcpyHostToDevice);

	cudaMemcpy(d_B, h_B, sizeof(float)*amount, cudaMemcpyHostToDevice);

	cudaMemcpy(d_C, h_C, sizeof(float)*amount, cudaMemcpyHostToDevice);

	

	// Run test on Device and copy result from Device to Host

	floatTestDevice<<< 1, amount >>>(d_A, d_B, d_C);

	cudaMemcpy(h_dC, d_C, sizeof(float)*amount, cudaMemcpyDeviceToHost);

	

	// Run test on Host

	floatTestHost(h_A, h_B, h_C, amount);

	

	// Print and compare results

	printf("Results: CPU -- GPU\t[difference]\n");

	for (int i = 0; i < amount; i++) {

		printf("%f %s %f ", h_C[i], ((h_C[i] == h_dC[i])?"==":"!="), h_dC[i]);

		if (h_C[i] != h_dC[i]) {

			printf("\tdiff: %f", (h_C[i]-h_dC[i]));

		}

		printf("\n");

	}

	cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);

	free(h_A); free(h_B); free(h_C); free(h_dC);

	return 0;

}

The difference in CPU and GPU answers is surprisingly big taking into account the small number of float multiplication and addition calculations performed:

Results: CPU -- GPU	[difference]

422.055756 != 422.055695 	diff: 0.000061

105.513939 != 105.513924 	diff: 0.000015

46.895081 == 46.895081 

26.378485 != 26.378481 	diff: 0.000004

16.882231 == 16.882231 

11.723770 == 11.723770 

8.613382 == 8.613382 

6.594621 != 6.594620 	diff: 0.000001

5.210565 == 5.210565 

4.220558 == 4.220558

So I figured that with my NN algorithm, which performs hundreds of times more float calculations than this example, the difference in the results gets pretty wild.

Is there something wrong with my code or do we just have to live with the hardware differences?

I think one of the problems may be that the GPU uses a MAD (multiply-add) instruction which does an intermediate rounding to zero. There are ways to disable that “feature” on the GPU and to force the compiler to generate explicit multiply and add instructions. Then the results of CPU and GPU are likely to be closer (if not identical).

There is another difference you should know about: Extremly small numbers (so-called denormals) are not handled by the GPU. They are flushed to zero. This can make a huge difference for some algorithms where intermediate results can become tiny.

Transcendental functions (sin, cos, sqrt etc…) can also have an implentation difference that could create different results. Your code doesn’t currently use these. The maximum inaccuracy is documented in the nVidia programming guide. The maximum inaccuracy is given in units of lowest precision (ULPs).

1 Like

I think one of the problems may be that the GPU uses a MAD (multiply-add) instruction which does an intermediate rounding to zero. There are ways to disable that “feature” on the GPU and to force the compiler to generate explicit multiply and add instructions. Then the results of CPU and GPU are likely to be closer (if not identical).

There is another difference you should know about: Extremly small numbers (so-called denormals) are not handled by the GPU. They are flushed to zero. This can make a huge difference for some algorithms where intermediate results can become tiny.

Transcendental functions (sin, cos, sqrt etc…) can also have an implentation difference that could create different results. Your code doesn’t currently use these. The maximum inaccuracy is documented in the nVidia programming guide. The maximum inaccuracy is given in units of lowest precision (ULPs).

1 Like

Please note that double precision has supported denormals from the time of introduction (i.e. since sm_13), and denormal support was added to single precision on sm_2x devices (Fermi). sm_2x devices give users the choice between denormal support of flush-to-zero behavior. This is exposed via the compiler switch -ftz={true|false}.

As cbuchner1 says, there are some differences at both HW and SW level between the CPU and the GPU that may cause the differences observed in this code.

(1) On the CPU (host) intermediate values in the evaluation of an expression are often kept in registers, and are represented there with (at least) double precision. To force the CPU to calculate strictly in single precision to allow for more of an apples-to-apples comparision one could change the code as follows:

[codebox]for (int i = 0; i < 10; i++) {

volatile float tmp;

 tmp = h_A[tx] * h_B[tx];

 tmp = h_C[tx] + tmp;           

 h_C[tx] = tmp;

}[/codebox]

(2) On the GPU (device) single-precision multiply followed by single-precision add is often contracted by the compiler into an FMAD (sm_1x devices) / FMA (sm_2x devices) as a performance optimization. The results of either differ slightly from a sequence of an IEEE-rounded multitiplication followed by an IEEE-rounded addition. In the FMAD, the product is truncated to single precision before being passed to the addition, while in the FMA (fused-multiply-add) the full, double-width product is passed to the addition. Because of this single rounding, FMA has many desirable numerical properties, which in turn allows for the efficient implementation of IEEE-rounded division and square root to give an example. In order to prevent the compiler from merging multiplication and addition into a single instruction, one can use the device functions for single-precision add/multiply with round-to-nearest-or-even:

[codebox]for (int i = 0; i < 10; i++) {

d_C[tx] = __fadd_rn(__fmul_rn(d_A[tx], d_B[tx]), d_C[tx]);

}[/codebox]

Please note that double precision has supported denormals from the time of introduction (i.e. since sm_13), and denormal support was added to single precision on sm_2x devices (Fermi). sm_2x devices give users the choice between denormal support of flush-to-zero behavior. This is exposed via the compiler switch -ftz={true|false}.

As cbuchner1 says, there are some differences at both HW and SW level between the CPU and the GPU that may cause the differences observed in this code.

(1) On the CPU (host) intermediate values in the evaluation of an expression are often kept in registers, and are represented there with (at least) double precision. To force the CPU to calculate strictly in single precision to allow for more of an apples-to-apples comparision one could change the code as follows:

[codebox]for (int i = 0; i < 10; i++) {

volatile float tmp;

 tmp = h_A[tx] * h_B[tx];

 tmp = h_C[tx] + tmp;           

 h_C[tx] = tmp;

}[/codebox]

(2) On the GPU (device) single-precision multiply followed by single-precision add is often contracted by the compiler into an FMAD (sm_1x devices) / FMA (sm_2x devices) as a performance optimization. The results of either differ slightly from a sequence of an IEEE-rounded multitiplication followed by an IEEE-rounded addition. In the FMAD, the product is truncated to single precision before being passed to the addition, while in the FMA (fused-multiply-add) the full, double-width product is passed to the addition. Because of this single rounding, FMA has many desirable numerical properties, which in turn allows for the efficient implementation of IEEE-rounded division and square root to give an example. In order to prevent the compiler from merging multiplication and addition into a single instruction, one can use the device functions for single-precision add/multiply with round-to-nearest-or-even:

[codebox]for (int i = 0; i < 10; i++) {

d_C[tx] = __fadd_rn(__fmul_rn(d_A[tx], d_B[tx]), d_C[tx]);

}[/codebox]

In all cases of your example where there is a difference between host and device results, it is exactly one ulp. So I would not be worried. The differences can be avoided using Kahan summation.

In all cases of your example where there is a difference between host and device results, it is exactly one ulp. So I would not be worried. The differences can be avoided using Kahan summation.