“A man with one watch always knows what time it is. A man with two watches can never be sure.”
In other words, given two sets of results with small differences, it is impossible to know which results are more correct (say, accuracy compared to a mathematical result with infinite precision) without detailed analysis. Generally speaking, one should not expect results of floating-point computation to match bit-wise across any two platforms (where platform comprises hardware and software elements).
The properties of basic floating-point operations are mandated by the IEEE-754 standard, which is adhered to by both common CPU architectures and NVIDIA GPUs. Beyond that, much depends on libraries, language bindings, and compiler optimizations.
For example, the Intel Fortran compiler comes with a math library that lets users choose the accuracy of transcendental functions. It has three profiles: HA = high accuracy (errors < 1 ulp), LA = low accuracy (errors < 4 ulp), EP = enhanced performance (up to half the bits may be wrong). CUDA offers just one math library, with accuracy generally between the HA and LA profiles of Intel’s library, depending on function. There may also be differences based on what functions are offered by each language. E.g. the cbrt()
function in C++ (and thus CUDA) is pretty much always more accurate than using Fortran-style computation x**(1D0/3D0).
Fortran allows much more aggressive re-association of floating-point arithmetic than is performed by the CUDA toolchain. The Fortran rule is basically that all transformations are allowed that are mathematically equivalent. The problem with that is that floating-point arithmetic is not math: basic arithmetic is not associative, for example. CUDA is quite conservative and other than for FMA contraction (which can be disabled, as you noted), pretty much evaluates floating-point expressions as written.
The Fortran compiler may autovectorize code using SIMD instructions leading to numerical differences through re-association of operations, in a reduction (such as a dot product) for example. Your code may be using parallelized reduction explicitly in CUDA; we don’t know.
Setting the Fortran compiler to the strictest possible floating-point setting (e.g. -fp-model=strict
) and turning off FMA-contraction (-fma-no-fma
), while using --fmad=false
with CUDA may minimize differences. The reason for turning off FMA contraction is that it can be applied differently for the same expression. Consider a * b + c * d
: fma (a, b, b * c)
or fma (c, d, a * b)
?
But that doesn’t tell us which platform delivers more accurate results, or whether there is a difference in overall accuracy at all. If one platform deviates +1% from the “true” result and the other -1% from the same, there will be observable differences but the accuracy of both is identical.
You would have to measure that in a way appropriate to the use case, e.g. computing norms or residuals, comparison with higher precision references. If tiny local differences in computation lead to large errors in the final result, the problem might be ill-conditioned (check condition number). Or a numerically sub-optimal algorithm may be in use.
In most circumstances FMA contraction improves accuracy incrementally due to reduced number of roundings and limited protection from subtractive cancellation.
Not sure what this means. If the output of your CUDA implementation differs from run to run, that may indicate non-determinism. Possible causes: (1) race condition (2) uninitialized data or access out of bounds (3) use of atomic floating-point operations. You can use Compute Sanitizer to check for (1) and (2).