Luobin had found a potential problem with floating-point precision.
When summing up a collection of floating-point numbers, the result
obtained from using GPUs differs from that from the CPUs.
It probably does unless you know what you’re doing in both cases. If you use standard floating point arithmetic on the CPU, it’s actually 80-bit floating point. Also, if you didn’t compile with -arch sm_13 and use a GT200 to run your tests, you won’t get double precision on the GPU.
Basically it can’t be expected to match exactly unless you use very specific constraints.
What are the actual magnitudes of the differences? If abs(calculated - expected) / expected <= 10^(-5), then there is nothing wrong.
If it is higher, then you are probably adding values of vastly different magnitudes. I.e. 10^12 + 1 = 10^12 in floating point. This resource has everything you need to know about why that happens (and more): http://docs.sun.com/source/806-3568/ncg_goldberg.html