Floating point operations difference between CPU and GPU

Hallo, I have an OpenCL kernel that implements a dot product between two float arrays.
The first is an array of size*n elements and the second is an array of n elements. This is the code

_kernel
void evaluate_product(__global const float *pFirstArray, const int n,
                      __global const float *pSecondArray, __global float *pOutput)

{
int gid = get_global_id(0); int size = get_global_size(0);
if (gid>=0 && gid<size) {
float output = 0;
for (int k=0; k<n; k++)
output += pFirstArray[gid + k*size]pSecondArray[k];
pOutput[gid] = output;
}
}
If I execute the same operations on CPU, I have different results, above all after 6 or 7 decimal digit. Why this strange behaviour? In kronos OpenCL specification (v 1.2) they say the x+y and x
y are correctly rounded as well as IEEE 754 compliant. Any ideas?

Hi!

I think this is due to the fact that the CPU and GPU order the operations differently. And since repeated floating point addition is not associative, the results will differ.

I hope this answers your question.

Check the rounds when there is a + operation after a * operation. The CPU does 2 rounds (one in *, and another in +)
, but the GPU do only 1 (Fusing mult-add).

If you want the same behaviour in your code you have to use

output = _dadd( output , pFirstArray[gid + k*size]*pSecondArray[k]);

Good luck