The code takes 100ms to run on fermi card and 26627ms to run on a AMD six cores CPU. The sequential code on CPU only takes 800 ms. And I am only using single precision not double, so I don’t think there is a huge difference in numerical error
I think the difference is because all x86 CPUs use 80-bit precision internally, unless using SSE instructions. This makes each operation slightly more accurate than on a Fermi, which can only use up to 64-bit precision.
The time difference for AMD CPU OpenCL vs sequential suggest that you are doing something wrong, i.e causing too much work on the OpenCL side of things.
Like other people said, how big is the difference and how is it expressed? Also it may not only be precision and IEEE compliance, but also compiler optimizations causing code reordering (although the amount of reordering allowed is very limited starting with ANSI C).
The way code is split into actual threads is very different with NVIDIA and AMD on the CPU. Also AMD CPU vs GPU. Internal differences that can cause big errors are mostly scheduling related (i.e bugs). Are you depending on warp level synchronization somewhere?
Question is what are the values inside the vector, not how you compute the difference. Although assuming that the result gets close to one and we are summing two errors, I’ll assume that we are talking somewhere between 5e-7 to 1e-7 relative error, that is I think 2 least significant bits (although I need to verify that). I would have preferred to see 1 least significant bit.
Another interesting test would be to compare both to results taken on the CPU in double precision (make sure to cast up to double precision and not down to float to compare). I wonder which is closer to the truth.