CUDA kernel converging to CPU double result thought G80 is single precision?

Hi -

I extended the sample SDK reduction code to use floats rather than ints. I am seeing some interesting behavior that I’m hoping someone here can clarify.

I tested the GPU results against a CPU version using float and double. For large arrays (~2^21 or over million) the GPU result agrees more closely to the CPU double precision result than the single precision float.

This behavior threw me for a loop since I thought the 8800 only supported single precision floating point?

Gfx card: 8800 GTX
OS: Windows XP
IDE: VisualStudio 2005

The CPU “single” precision and GPU single precision are not exactly the same. CPU floating point registers are 80 bits (unless you’re using SSE) so if the variable you’re accumulating into stays in a register and never gets written out to memory until you’re done adding all the numbers I can see this causing a difference.

GPU:

32 bits + 32 bits = 32 bits

CPU:

80 bits + 32 bits = 80 bits

And also the ordering is surely different and floating point addition doesn’t commute.

Probably you’re just getting lucky that it happens to be closer to the double precision result.

If you want to add up that many numbers in single precision accurately you should use something like the kahan summation algorithm (http://en.wikipedia.org/wiki/Kahan_summation_algorithm) or multiple buckets.

Post your CPU single precision implementation. If you’re just adding the floats to a single precision accumulator in a loop, you will not get the correct result.

Hi Jim,

Yup, that’s exactly what I’m doing:

   float *p = new float[ nSize ];

    double dSumCPU = 0;

    float fSumCPU = 0;

    for( int i=0;i<nSize;i++ )

    {

        p[i] = i;

       fSumCPU += p[i];

        dSumCPU += p[i];

    }

What is the correct code?

Thanks very much!

Daren

Use the double precision accumulator. Or use the Kahan algorithm eelsen mentioned.

Thanks jim and eelsen! I used the Kahan method and now the results agree to 1e-7.

just wandering – are you sure the compiler didn’t optimize the Kahan sum back to an ordinary sum?