Luobin had found a potential problem with floating-point precision.
When summing up a collection of floating-point numbers, the result
obtained from using GPUs differs from that from the CPUs.
It probably does unless you know what you’re doing in both cases. If you use standard floating point arithmetic on the CPU, it’s actually 80-bit floating point. Also, if you didn’t compile with -arch sm_13 and use a GT200 to run your tests, you won’t get double precision on the GPU.
Basically it can’t be expected to match exactly unless you use very specific constraints.