Why does device give wrong answer to simple math?

stirlingc · July 7, 2011, 2:18pm

I have a kernel that does the following:

Initial values:
vlpl = 397.7436218262
UAlr = 1600.0000000000
vrml = -397.7436218262
UAcl = 1600.0000000000

Then:
mfl = vlplUAlr + vrmlUAcl;

The value returned for mfl is -0.0175781250. It should be 0.0.

If I change this bit of code to:

f1 = vlplUAlr;
f2 = vrmlUAcl;
f3 = f1+f2;
mfl = f3;

The value returned for mfl is 0.0.
I’m using floats throughout.

Can anyone tell me what is going on here? I’m sure i’m missing something obvious.

tera · July 7, 2011, 3:21pm

Rounding error - float vars only keep about 7 significant digits.

seibert · July 7, 2011, 3:36pm

Well, the general issue is that floating point numbers are not real numbers, and so the order of operations matters in cases where they don’t for real numbers. That and the “same” number number can sometimes not be the same if calculated in a different context.

Given that you are subtracting 636389.79492192005 from itself and getting an error of -0.0175781250, that is a fractional error of 2.7e-8, which is basically the precision of a float. You should not expect agreement better than this in general.

I am a little curious why the two cases result in a different instruction sequence, but we would need to see the generated PTX to figure that out. (I suspect the details will boil down to the default rounding conventions for positive vs. negative numbers.)

stirlingc · July 7, 2011, 3:57pm

OK, that makes sense. I’m trying to track down differences in CPU vs GPU code and the first point I’ve found where they diverge is at this operation. The CPU code gives zero and the GPU gives non-zero, both using floats and the same piece of code. I’m not sure at this point if it is the root cause of bigger problems later on, and if I need to go to doubles for this particular numerical scheme.

But if the rounding/ordering of these operations is different then I can see why the CPU and GPU (and the other GPU variation) would give different answers in the 8th SF.

Thanks.

njuffa · July 7, 2011, 5:54pm

The most likely cause for the difference between the two instruction sequences is MAD/FMA contraction performed by the compiler as an optimization. You can check the generated machine code with cuobjdump to verify / falsify this hypothesis. As both Open64 and PTXAS can perform this optimization it is insufficient to just look at the intermediate PTX.

Basically, I suspect that

mfl = vlplUAlr + vrmlUAcl

is translated into

mfl = {mad|fma} (vlpl, UAlr, vrml*UAcl)

If so, for both MAD and FMA the product vlplUAlr is formed in a different way than the product vrmlUAcl, which explains why their sum is not zero. For MAD [only applies to single precision] the product vlplUAlr would be truncated to single precision prior to the addition. With FMA [single precision on Fermi, and double precision] the product vlplUAlr is computed to twice the native precision, and enters the addition in that format. By contrast vrml*UAcl is rounded to native precision prior to the addition.

You can use intrinsics to locally disable MAD/FMA contraction, by re-writing the code as follows:

mfl = __fmul_rn(vlpl, UAlr) + __fmul_rn (vrml, UAcl); // single precision
mfl = __dmul_rn(vlpl, UAlr) + __dmul_rn (vrml, UAcl); // double precision

In general, comparing CPU and GPU results directly is not a good way of establishing whether the GPU results are correct (by some definition of correct). I would recommend comparison with a higher precision reference to make that call. I use that approach extensively in my own work. I find that often, the GPU results are more accurate, while different from the CPU results.

NVIDIA has a new whitepaper out that addresses some of the discrepancies in floating-point computation between CPU and GPU which you may find useful. The author welcomes feedback. Get it here:

http://developer.nvidia.com/content/everything-you-ever-wanted-know-about-floating-point-were-afraid-ask

njuffa · November 16, 2011, 2:24am

To give programmers more control over FMA / FMAD merging, the CUDA 4.1 compiler implements a new flag -fmad={true|false}, which allows control at compilation unit granularity, which can be more convenient than the per-operation control using intrinsics, that I described above.

With -fmad=true the compiler is allowed to merge a floating-point add with a following floating-point multiply into a MAD (sm_1x single-precision) or FMA (sm_13 and sm_2x double precision, sm_2x single precision) operation. For reasons of backwards compatibility, this is also the default.

With -fmad=false the compiler is prohibited from merging a floating-point add with a following floating-point multiply into a MAD (sm_1x single-precision) or FMA (sm_13 and sm_2x double precision, sm_2x single precision) operation. This switch setting does not interfere with any FMAs that have been coded explicitly either via the fma() and fmaf() math functions, or the __fma_r{n,z,u,d} and __fmaf_r{n,z,u,d} intrinsics.

Topic		Replies	Views
Several questions on cuda (arithmetic, rounding, for loop ad performance) CUDA Programming and Performance	8	3896	April 13, 2023
FMA precision issue CUDA Programming and Performance	9	19531	November 21, 2010
Why are the calculations different between CPU and GPU? CUDA Programming and Performance	2	917	February 7, 2020
CPU and CUDA code yield different results? CUDA Programming and Performance	3	1186	June 28, 2013
precision CUDA Programming and Performance	3	2670	December 16, 2008
Precision Fail CUDA Programming and Performance	5	10581	March 10, 2011
CPU and GPU floating point calculations Results are different CUDA Programming and Performance	6	22182	August 7, 2010
Precision of floats does CUDA use half precision instead of single precision for floats? CUDA Programming and Performance	5	2378	March 15, 2010
GPU Code and CPU Code output not matching till machine precision (i.e. 13 decimals places) CUDA Programming and Performance	22	1092	August 9, 2023
Floating point operations difference between CPU and GPU Announcements	2	2624	January 22, 2013

Why does device give wrong answer to simple math?

Related topics