Floating point operations difference between CPU and GPU

Hallo, I have an OpenCL kernel that implements a dot product between two float arrays.
The first is an array of size*n elements and the second is an array of n elements. This is the code

_kernel
void evaluate_product(__global const float *pFirstArray, const int n,
                      __global const float *pSecondArray, __global float *pOutput)

{
int gid = get_global_id(0); int size = get_global_size(0);
if (gid>=0 && gid<size) {
float output = 0;
for (int k=0; k<n; k++)
output += pFirstArray[gid + k*size]pSecondArray[k];
pOutput[gid] = output;
}
}
If I execute the same operations on CPU, I have different results, above all after 6 or 7 decimal digit. Why this strange behaviour? In kronos OpenCL specification (v 1.2) they say the x+y and x
y are correctly rounded as well as IEEE 754 compliant. Any ideas?

on cpu you use double precision 15 digit
on gpu you use single precision 8 digit who are convert on double when cpu print them

use double not float

CPUs tend to do floating point calculations in 80-bit ‘extended’ mode and keep the results in this intermediate format. As such subsequent calculations are using the 80 bit value.
On the GPU the single precision is 32 bit and double is 64 bit. As such doing lots of calculations on floating points you are likely to get small differences even if you choose to use double precision.

I would suggest readin this (and any of the references cites):

[url]https://developer.nvidia.com/sites/default/files/akamai/cuda/files/NVIDIA-CUDA-Floating-Point.pdf[/url]

Hi,
tks to everybody for your answers. Now I have the same results using double (both on CPU and GPU). Values differ only after the 14 decimal digit.
Now I have another question. In my kernel I need to do a thresholding of the output double value computed and for that I use this line of code:

if (output <= 0) pOutput[gid] = 1 (pOutput is initialized with all 0 values)

But I obtain 1 values corresponding to double value 0.02192208167984 or 0.00040051234362.
How can it be possible?

Don’t you have to write

if (output <= 0.0f) pOutput[gid] = 1.0f