You may need to learn a few things about floats. But please excuse if I mention

something that is already obvious to you.

a + b and b + a usually give different results (float opperations are non-associative),

so when you compute something in parallel using parallel reduction or with a scan,

results *naturally* differ from doing the same in a sequential loop.

EDIT: correction: (a+b)+c and a+(b+c) usually give different results

```
a + b and b + a should give identical results though.
```

There are tricks like Kahan summation to reduce the errors when taking sums of many

elements, you may want to look into this.

Generally when adding values of small magnitude to values of large magnitude, you

will incur a significant loss of precision. When you have large arrays and you add

the block results of the scan to the total, at one point the intermediate result becomes

large, and the individual block results are relatively small. This is where your large error

may come from.

You will need to get Compute 2.0 devices in order to do double precision on the GPU

(GTX 260 or better). On compute 1.x devices you could try emulating double precision

with two floats (also known as double single precision). There is some code on these

forums do do just that. It’s quite slow however.

And finally, float is only precise to some 7 decimal digits, so when you have 5 significant

digits before the decimal point, you can expect only 2 digits after the decimal point to be

accurate.