Frame challenge: the CPU result has some precision errors compared to the GPU result. Don’t believe me? Prove me wrong!
Hint: In the majority of cases where I investigated reports of numerical mismatches between CPU and GPU results, the GPU results were in fact more accurate.
Without careful checking, we have no way of knowing which set of results is more accurate. One way of determining this is by comparing with reference computations performed at higher precision. Often, double-precision arithmetic is sufficient to check single-precision computation, and quadruple precision is sufficient to check double-precision computation.
Since floating-point arithmetic is not associative, re-arranging floating-point expressions (also called re-association) into mathematically equivalent variants can cause numerical differences in final results. This means results can differ when changing compilers or changing optimization settings. Many host compilers provide compiler switches that enforce strict adherence to IEEE-754 semantics. For example, on Linux, clang
and the Intel compiler use -ffp-model=strict
for this; the Intel compiler also accepts -fp-model=strict
for backwards compatibility. Use these switches after the command line switches specifying the optimization level.
The CUDA C++ compiler at default settings provides strict adherence to IEEE-754 semantics, with one exception: To enhance performance and average accuracy, it allows the contraction of FMUL plus dependent FADD into FMA (fused multiply-add). This can be turned off by specifying -fmad=false
on the nvcc
command line, so you might want to try this. Again, this may have a negative impact on performance and accuracy. Strictly avoid any use of -use_fast_math
or its constituent flags, such as -prec-div=false
.
Your code snippet appears to be Fortran and not C++, though. If so, please ask in the CUDA Fortran subforum for equivalent advice in Fortran context. To my limited knowledge, Fortran, even with IEEE bindings, provides more leeway than C++ for compilers to re-arrange floating-point computation. I always found this lack of programmer control surprising for a language targeted at numerical computations.
From developing across a number of different platforms for four decades I can share that achieving exactly (bit-wise) matching results between any two platforms pretty much never happens for non-trivial computations. For a while, many programmers forgot about this basic fact of life, since the computing world was a x86 monoculture. For regression testing it is therefore essential to rely on some sort of “third-party” reference or arbiter to establish whether relevant error bounds are being maintained. This may include higher-precision computation or the use of algorithms known to provide more accurate results but at lower performance not suitable for production software.