I extended the sample SDK reduction code to use floats rather than ints. I am seeing some interesting behavior that I’m hoping someone here can clarify.
I tested the GPU results against a CPU version using float and double. For large arrays (~2^21 or over million) the GPU result agrees more closely to the CPU double precision result than the single precision float.
This behavior threw me for a loop since I thought the 8800 only supported single precision floating point?
Gfx card: 8800 GTX
OS: Windows XP
IDE: VisualStudio 2005
The CPU “single” precision and GPU single precision are not exactly the same. CPU floating point registers are 80 bits (unless you’re using SSE) so if the variable you’re accumulating into stays in a register and never gets written out to memory until you’re done adding all the numbers I can see this causing a difference.
GPU:
32 bits + 32 bits = 32 bits
CPU:
80 bits + 32 bits = 80 bits
And also the ordering is surely different and floating point addition doesn’t commute.
Probably you’re just getting lucky that it happens to be closer to the double precision result.
If you want to add up that many numbers in single precision accurately you should use something like the kahan summation algorithm (Kahan summation algorithm - Wikipedia) or multiple buckets.
Post your CPU single precision implementation. If you’re just adding the floats to a single precision accumulator in a loop, you will not get the correct result.