You may need to learn a few things about floats. But please excuse if I mention
something that is already obvious to you.
a + b and b + a usually give different results (float opperations are non-associative),
so when you compute something in parallel using parallel reduction or with a scan,
results naturally differ from doing the same in a sequential loop.
EDIT: correction: (a+b)+c and a+(b+c) usually give different results
a + b and b + a should give identical results though.
There are tricks like Kahan summation to reduce the errors when taking sums of many
elements, you may want to look into this.
Generally when adding values of small magnitude to values of large magnitude, you
will incur a significant loss of precision. When you have large arrays and you add
the block results of the scan to the total, at one point the intermediate result becomes
large, and the individual block results are relatively small. This is where your large error
may come from.
You will need to get Compute 2.0 devices in order to do double precision on the GPU
(GTX 260 or better). On compute 1.x devices you could try emulating double precision
with two floats (also known as double single precision). There is some code on these
forums do do just that. It’s quite slow however.
And finally, float is only precise to some 7 decimal digits, so when you have 5 significant
digits before the decimal point, you can expect only 2 digits after the decimal point to be
accurate.