Why OpenCL on Fermi result different from OpenCL on AMD CPU

Hi I have a Optical Flow code in OpenCL. I modify the code to fit into Fermi GPU and AMD CPU. However, when I ran the code on these two different platforms, the results are different.

Could anyone give me some hints what caused this?

Is that possible because I used texture memory(2d image)?


How different are they? Could it be a difference in numerical error? Numerical error could be added by a lot of different things.

AMD CPU are fully IEEE compliant, Fermi GPU are not, so computation errors and error propagation could lead to different results depending on your computations.

The code takes 100ms to run on fermi card and 26627ms to run on a AMD six cores CPU. The sequential code on CPU only takes 800 ms. And I am only using single precision not double, so I don’t think there is a huge difference in numerical error

?? AFAIK Fermi is IEEE 754-2008 compliant…

See for instance the whitepaper : http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

I think the difference is because all x86 CPUs use 80-bit precision internally, unless using SSE instructions. This makes each operation slightly more accurate than on a Fermi, which can only use up to 64-bit precision.

Can you share the code for testing.
i have the ati 6870 and Fermi.

The time difference for AMD CPU OpenCL vs sequential suggest that you are doing something wrong, i.e causing too much work on the OpenCL side of things.

Like other people said, how big is the difference and how is it expressed? Also it may not only be precision and IEEE compliance, but also compiler optimizations causing code reordering (although the amount of reordering allowed is very limited starting with ANSI C).

The way code is split into actual threads is very different with NVIDIA and AMD on the CPU. Also AMD CPU vs GPU. Internal differences that can cause big errors are mostly scheduling related (i.e bugs). Are you depending on warp level synchronization somewhere?

math functions are different for sure. I experience difference of 0.000000238419 with such kernel:

__kernel void foo(__global float * v1, __global float * v2, __global float * v3)


	int i = get_global_id(0);

	v3[i] = cos(v1[i]) + cos(v2[i]);


Absolute or relative? 32 bit float has a relative accuracy of about 1e-7, so if that is compared to the everage 0.5 or so, it’s a bit large or an error, but understandable.

Note that cos is a library function and not an intrinsic function, so there are probably implementation changes

float maxdiff = 0;

for(long i = 0; i < VECTOR_SIZE; i++)


	float delta = abs(resCPU[i] - resGPU[i]);

	if(delta > maxdiff)


		maxdiff = delta;



resCPU and resGPU are v3 from CPU and GPU. maxdiff gets up to 0.000000238419

Question is what are the values inside the vector, not how you compute the difference. Although assuming that the result gets close to one and we are summing two errors, I’ll assume that we are talking somewhere between 5e-7 to 1e-7 relative error, that is I think 2 least significant bits (although I need to verify that). I would have preferred to see 1 least significant bit.

Another interesting test would be to compare both to results taken on the CPU in double precision (make sure to cast up to double precision and not down to float to compare). I wonder which is closer to the truth.