Well, the general issue is that floating point numbers are not real numbers, and so the order of operations matters in cases where they don’t for real numbers. That and the “same” number number can sometimes not be the same if calculated in a different context.

Given that you are subtracting 636389.79492192005 from itself and getting an error of -0.0175781250, that is a fractional error of 2.7e-8, which is basically the precision of a float. You should not expect agreement better than this in general.

I am a little curious why the two cases result in a different instruction sequence, but we would need to see the generated PTX to figure that out. (I suspect the details will boil down to the default rounding conventions for positive vs. negative numbers.)

OK, that makes sense. I’m trying to track down differences in CPU vs GPU code and the first point I’ve found where they diverge is at this operation. The CPU code gives zero and the GPU gives non-zero, both using floats and the same piece of code. I’m not sure at this point if it is the root cause of bigger problems later on, and if I need to go to doubles for this particular numerical scheme.

But if the rounding/ordering of these operations is different then I can see why the CPU and GPU (and the other GPU variation) would give different answers in the 8th SF.

The most likely cause for the difference between the two instruction sequences is MAD/FMA contraction performed by the compiler as an optimization. You can check the generated machine code with cuobjdump to verify / falsify this hypothesis. As both Open64 and PTXAS can perform this optimization it is insufficient to just look at the intermediate PTX.

Basically, I suspect that

mfl = vlplUAlr + vrmlUAcl

is translated into

mfl = {mad|fma} (vlpl, UAlr, vrml*UAcl)

If so, for both MAD and FMA the product vlplUAlr is formed in a different way than the product vrmlUAcl, which explains why their sum is not zero. For MAD [only applies to single precision] the product vlplUAlr would be truncated to single precision prior to the addition. With FMA [single precision on Fermi, and double precision] the product vlplUAlr is computed to twice the native precision, and enters the addition in that format. By contrast vrml*UAcl is rounded to native precision prior to the addition.

You can use intrinsics to locally disable MAD/FMA contraction, by re-writing the code as follows:

In general, comparing CPU and GPU results directly is not a good way of establishing whether the GPU results are correct (by some definition of correct). I would recommend comparison with a higher precision reference to make that call. I use that approach extensively in my own work. I find that often, the GPU results are more accurate, while different from the CPU results.

NVIDIA has a new whitepaper out that addresses some of the discrepancies in floating-point computation between CPU and GPU which you may find useful. The author welcomes feedback. Get it here:

To give programmers more control over FMA / FMAD merging, the CUDA 4.1 compiler implements a new flag -fmad={true|false}, which allows control at compilation unit granularity, which can be more convenient than the per-operation control using intrinsics, that I described above.

With -fmad=true the compiler is allowed to merge a floating-point add with a following floating-point multiply into a MAD (sm_1x single-precision) or FMA (sm_13 and sm_2x double precision, sm_2x single precision) operation. For reasons of backwards compatibility, this is also the default.

With -fmad=false the compiler is prohibited from merging a floating-point add with a following floating-point multiply into a MAD (sm_1x single-precision) or FMA (sm_13 and sm_2x double precision, sm_2x single precision) operation. This switch setting does not interfere with any FMAs that have been coded explicitly either via the fma() and fmaf() math functions, or the __fma_r{n,z,u,d} and __fmaf_r{n,z,u,d} intrinsics.