Same code, same input, different results

I am hunting down some numerical issue with my application between two commits. I narrow it down to the following function:

__device__ __forceinline__ Vector2f
rotate(const Vector2f& v1, const Vector2f& v2) {
  Vector2f out {v2.x() * v1.x() + v2.y() * v1.y(), -v2.y() * v1.x() + v2.x() * v1.y()};
  return out;

where the input/output types are Eigen 2D vector.

To debug the issue, I added printf("...%14.6a...") in this function to print all the in/out values and found:

At commit 1:

# (v1.x(), v1.y(); v2.x(), v2.y()) -> (out.x(), out.y())
( 0x1.02aa4ep+2,-0x1.589246p-9; 0x1.ffffdep-1,0x1.46509cp-10) -> ( 0x1.02aa2ep+2,-0x1.f5ff6ap-8)

At commit 2:

# (v1.x(), v1.y(); v2.x(), v2.y()) -> (out.x(), out.y())
( 0x1.02aa4ep+2,-0x1.589246p-9; 0x1.ffffdep-1,0x1.46509cp-10) -> ( 0x1.02aa30p+2,-0x1.f5ff6ap-8)

The four input floats are bitwise identical, but one of the output is different by the least significant bit. There is not change to this function (inlined, in header file). --fmad is not specified anywhere in nvcc options.

What might possibly be the underlying issue here? How can code changes else where affects the behavior of this function?

Note that not all call to this function resulted in different result between two commits, another call that was logged had:

( 0x1.e7f416p+2, 0x1.8df0bcp-3; 0x1.ffffd8p-1,-0x1.5ea958p-10) -> ( 0x1.e7efaep+2, 0x1.a2d3b0p-3)

on both commits.

Can you confirm that both instances

(1) Use exactly the same compiler version, and do not rely on JIT compilation.
(2) Target the exact same GPU architecture.
(3) Pass exactly the same flags to nvcc.

Since no --fmad flag is specified, the compiler defaults to FMA contraction, i.e. -fmad=true. FMA contraction combines an FMUL with a dependent FADD into FMA as the compiler sees fit. In particular a * b + c * d may be contracted as fma (a, b, c*d) or fma (c, d, a*b). Changes in the way contraction takes place could result in numerical differences. Since the function is inlined, the expression tree the compiler looks at when applying the FMA contraction may look differently depending on changes to the calling context(s). The way to isolate the computation from compiler magic is to hardcode fma() calls in the code.

Specifically for the stable and accurate evaluation of sum or difference of two products I would recommend the approach I described on Stackoverflow (if performance permits):

Accurate floating-point computation of the sum and difference of two products

By using fma() directly, it should eliminate interference by the compiler’s FMA contraction optimization.

I used the data from the OP to examine how the two FMA contraction variants would affect the result. Here is the output of my little test program:

fmaf (v2x, v1x, v2y * v1y) =   0x1.02aa30p+2
fmaf (v2y, v1y, v2x * v1x) =   0x1.02aa2ep+2

fmaf (-v2y, v1x, v2x * v1y) =  -0x1.f5ff6ap-8
fmaf (v2x, v1y, -v2y * v1x) =  -0x1.f5ff6ap-8

So are we saying that expression like

a * b + c * d

nvcc, with --fmad=true, is free to choose which side to contract and so the result is numerically unstable?

What about maybe not explicit expression like this, but some chain of inlined * and + that is not confined to a single function?

Yes, that is my understanding. How the compiler re-associates a flaoting-point expression will likely depend on the internal representation of the expression tree. Because functions often get inlined, the expression tree can comprise operations from more than one function, and because the CUDA compiler inlines functions aggressively, this will often be the case.

Things you can do:

(1) Turn off FMA-contraction by specifying --fmad=false. The negative impact of this is a likely loss of performance, and a likely loss of overall accuracy.

(2) Explicitly specify the operations you want. (a) You can use the C++ standard math functions fmaf() and fma() in the way most beneficial to overall accuracy. In non-trivial expression trees it is often not obvious how to combine all operands with FMAs for best accuracy, some numerical reasoning and/or experimentation may be necessary. For example, when computing ab+cd, if |a*b| is greater than |c*d|, we want fma (a, b, c*d) for best accuracy. (b) With --fmad=true, CUDA lets you opt out of FMA contraction locally by using device function intrinsics e.g. __fmul_rn(), __fadd_rn(). The compiler will not consider such multiplications and additions contractable.

In my own work, I find that option (1) is generally unattractive, that option (2a) is frequently beneficial and that option (2b) is rarely needed, but can be crucial to preserve specific numerical properties.

Generally speaking, any programmer involved with floating-point computation on any modern high-performance processing platform should become intimately familiar with FMA so they can maximally exploit the properties of this operation. Think of it as another species of basic arithmetic.

I would not look at it this way. Numerically unstable to me is a property inherent in a particular computations, or sequence of computations.

Various modern compilers allow re-association of floating-point operations by default, and since floating-point arithmetic is not associative, this can lead to numerical result different from following the serial computation as written. For example, parallelizing a sum through the use of multiple accumulators stored in a single SIMD register will often do that. On average, this is great for performance and even beneficial to accuracy, but it will generally not match the result from serial computation bitwise.

The details can even vary from compiler version to compiler version, and depending on compiler switches (e.g. compiling for AVX2 vs AVX512) and one cannot know (short of perusing the compiler source code) the details and amount of re-association happening. Most compilers also have switches to turn-off re-association of floating-point operations. Sometimes this is a single switch, sometimes it is a whole assortment of switches (gcc is in the latter category for example, the Intel compiler icc is in the former category).

When it comes to floating-point computation the CUDA compiler is more conservative than most compilers in order to preserve programmer sanity. As far as I am aware, the only re-association it will perform is FMA contraction (because that is so important to performance), and that can be turned off.