Help Needed: Precision Mismatch between GPU and CPU Calculations of AAD Limiter

Hello everyone,

I’m encountering a precision mismatch issue when performing calculations using CUDA, and I would appreciate any help or insights.

I am implementing the Van Leer limiter, and the formula I’m using is as follows:
AAD(1,II,IY) = (SIGN(AA, DFD1) + SIGN(AA, DFD2)) *
(ABS(DFD1) * ABS(DFD2)) / (ABS(DFD1) + ABS(DFD2) + EP) * DX
Where:

  • DFD1 and DFD2 are the differences of the initial values on two grid points.
  • EP is a small constant, EP = 1e-7.
  • AA is a constant, AA = 1.0.

I have noticed that when performing the same calculations on CPU and GPU, the results are not consistent in terms of precision. Specifically, the GPU result has some precision errors compared to the CPU result.

I’ve double-checked the code and confirmed that the same input data is being used on both CPU and GPU, but the results differ. I suspect the issue could be related to floating-point precision or computation order on the GPU, but I am not sure of the exact cause.

I have a few questions:

  1. What could be causing the precision mismatch?
  2. Is there a way to ensure the GPU calculations match the CPU precision?
  3. Any advice or optimizations to avoid this issue?

Thank you in advance for your help!


You can use this version of the post to ask for assistance on the NVIDIA Developer Forums or other platforms.

Frame challenge: the CPU result has some precision errors compared to the GPU result. Don’t believe me? Prove me wrong!

Hint: In the majority of cases where I investigated reports of numerical mismatches between CPU and GPU results, the GPU results were in fact more accurate.

Without careful checking, we have no way of knowing which set of results is more accurate. One way of determining this is by comparing with reference computations performed at higher precision. Often, double-precision arithmetic is sufficient to check single-precision computation, and quadruple precision is sufficient to check double-precision computation.

Since floating-point arithmetic is not associative, re-arranging floating-point expressions (also called re-association) into mathematically equivalent variants can cause numerical differences in final results. This means results can differ when changing compilers or changing optimization settings. Many host compilers provide compiler switches that enforce strict adherence to IEEE-754 semantics. For example, on Linux, clang and the Intel compiler use -ffp-model=strict for this; the Intel compiler also accepts -fp-model=strict for backwards compatibility. Use these switches after the command line switches specifying the optimization level.

The CUDA C++ compiler at default settings provides strict adherence to IEEE-754 semantics, with one exception: To enhance performance and average accuracy, it allows the contraction of FMUL plus dependent FADD into FMA (fused multiply-add). This can be turned off by specifying -fmad=false on the nvcc command line, so you might want to try this. Again, this may have a negative impact on performance and accuracy. Strictly avoid any use of -use_fast_math or its constituent flags, such as -prec-div=false.

Your code snippet appears to be Fortran and not C++, though. If so, please ask in the CUDA Fortran subforum for equivalent advice in Fortran context. To my limited knowledge, Fortran, even with IEEE bindings, provides more leeway than C++ for compilers to re-arrange floating-point computation. I always found this lack of programmer control surprising for a language targeted at numerical computations.

From developing across a number of different platforms for four decades I can share that achieving exactly (bit-wise) matching results between any two platforms pretty much never happens for non-trivial computations. For a while, many programmers forgot about this basic fact of life, since the computing world was a x86 monoculture. For regression testing it is therefore essential to rely on some sort of “third-party” reference or arbiter to establish whether relevant error bounds are being maintained. This may include higher-precision computation or the use of algorithms known to provide more accurate results but at lower performance not suitable for production software.

1 Like

You are probably seeing the use of FMA instructions on GPU ( that as Norbert already said are usually more accurate than FMUL plus FADD). Depending on the generation of your CPU and where the floating point operations are performed, you may have FMA operations also on CPU ( but this does not seems to be your case).
If you print the value in hex format, the difference will probably be in the last bit.

In CUDA Fortran, it is possible to disable generating GPU FMA instructions using the compiler option -gpu=nofma.

There is also a flag to disable FMA on the CPU, -Mnofma (it is important to note that
this is a global compiler flag, and to keep generating FMA instructions on the GPU, we would need to explicitly specify -gpu=fma).

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.