Is CUDA's implementation of 64-bit floating precision in practice subpar to that of Fortran?

Hello all,

I have been developing a numerical model in CUDA that was originally developed in Fortran. I am using the Fortran version to verify the CUDA version by comparing their numerical outputs. The outputs should be the same but when I compare them there are differences, for example as shown below:

These differences may appear negligible at first, in the order of 10^{-14}, but the outputs of the CUDA version diverge significantly from those of the Fortran version after 10 to 100 iterations of the model.

I have compared the CUDA and Fortran versions line-by-line and in both versions, I am performing the same computations using the same values. However, I am getting differences in the outputs of the computations, similar to the example above. The outputs of the Fortran version are exact and stay exact, but the outputs of the CUDA version fluctuate.

I would really appreciate any advice on how to make the outputs of the CUDA version match those of the Fortran version. Should I be careful about the order of operations (add, subtract, multiply, sqrt, pow, etc)? Should I compile with certain flags? I have already tried both --fmad=falseand --fmad=true.

I am obviously using double in my CUDA code, and the Fortran code uses double precision.

I am on running CUDA 11.5 a Windows machine with a GTX 1050.

Lastly, I am compiling the Fortran code using the Intel Fortran compiler.

Many thanks, Alovya

“A man with one watch always knows what time it is. A man with two watches can never be sure.”

In other words, given two sets of results with small differences, it is impossible to know which results are more correct (say, accuracy compared to a mathematical result with infinite precision) without detailed analysis. Generally speaking, one should not expect results of floating-point computation to match bit-wise across any two platforms (where platform comprises hardware and software elements).

The properties of basic floating-point operations are mandated by the IEEE-754 standard, which is adhered to by both common CPU architectures and NVIDIA GPUs. Beyond that, much depends on libraries, language bindings, and compiler optimizations.

For example, the Intel Fortran compiler comes with a math library that lets users choose the accuracy of transcendental functions. It has three profiles: HA = high accuracy (errors < 1 ulp), LA = low accuracy (errors < 4 ulp), EP = enhanced performance (up to half the bits may be wrong). CUDA offers just one math library, with accuracy generally between the HA and LA profiles of Intel’s library, depending on function. There may also be differences based on what functions are offered by each language. E.g. the cbrt() function in C++ (and thus CUDA) is pretty much always more accurate than using Fortran-style computation x**(1D0/3D0).

Fortran allows much more aggressive re-association of floating-point arithmetic than is performed by the CUDA toolchain. The Fortran rule is basically that all transformations are allowed that are mathematically equivalent. The problem with that is that floating-point arithmetic is not math: basic arithmetic is not associative, for example. CUDA is quite conservative and other than for FMA contraction (which can be disabled, as you noted), pretty much evaluates floating-point expressions as written.

The Fortran compiler may autovectorize code using SIMD instructions leading to numerical differences through re-association of operations, in a reduction (such as a dot product) for example. Your code may be using parallelized reduction explicitly in CUDA; we don’t know.

Setting the Fortran compiler to the strictest possible floating-point setting (e.g. -fp-model=strict) and turning off FMA-contraction (-fma-no-fma), while using --fmad=false with CUDA may minimize differences. The reason for turning off FMA contraction is that it can be applied differently for the same expression. Consider a * b + c * d: fma (a, b, b * c) or fma (c, d, a * b)?

But that doesn’t tell us which platform delivers more accurate results, or whether there is a difference in overall accuracy at all. If one platform deviates +1% from the “true” result and the other -1% from the same, there will be observable differences but the accuracy of both is identical.

You would have to measure that in a way appropriate to the use case, e.g. computing norms or residuals, comparison with higher precision references. If tiny local differences in computation lead to large errors in the final result, the problem might be ill-conditioned (check condition number). Or a numerically sub-optimal algorithm may be in use.

In most circumstances FMA contraction improves accuracy incrementally due to reduced number of roundings and limited protection from subtractive cancellation.

Not sure what this means. If the output of your CUDA implementation differs from run to run, that may indicate non-determinism. Possible causes: (1) race condition (2) uninitialized data or access out of bounds (3) use of atomic floating-point operations. You can use Compute Sanitizer to check for (1) and (2).

Maybe you could try to use MPFR to find out a “accurate-enough (maybe 1000-bit)” result of your codes and check whether Fortan or CUDA is giving a more acurate result.

MPFR : https://www.mpfr.org/
is often regarded as the golden reference of many floating point operations.