Floating point results mismatch: CPU vs OpenACC


The results between the CPU results and openACC GPU results are not matching to each other. Both executables are compiled with PGI-15.4 Fortran compiler and with -Kieee option. The vimdiff output of the same is copied below:

L, CFL[xyz]max: 1 0.5683299913571619 | L, CFL[xyz]max: 1 0.5683299913571619
0.1273464758709**242** 0.000000000000000 | 0.1273464758709**578** 0.000000000000000
L, CFL[xyz]max: 2 0.6531875907627893 | L, CFL[xyz]max: 2 0.6531875907627893
0.1595491883002**556** 9.1393459670539370E-002 | 0.1595491883002**926** 9.1393459670539370E-002
L, CFL[xyz]max: 3 0.6136218439579227 | L, CFL[xyz]max: 3 0.6136218439579227
0.13731536255**29625** 0.1561959406772873 | 0.13731536255**30031** 0.1561959406772873
L, CFL[xyz]max: 4 0.5379380549130**823** | L, CFL[xyz]max: 4 0.5379380549130**339**
0.1341207044526441 0.1989663002452234 | 0.1341207044526441 0.1989663002452234
L, CFL[xyz]max: 5 0.57548155648**40166** | L, CFL[xyz]max: 5 0.5754815564839401

The code of the applied OpenACC pragmas is pasted below:

DO j = 1, numgbr
    inoutf = cldon*(j-1)
!$acc parallel firstprivate(inoutf)
!$acc loop
    DO i = 1, cldon
      zebfrcre(i) = frcre(inoutf+i)
!     zebfrcre(i) = 1.     !!!!!!  essai MPL 19052010
      zerm0(i) = rm0(inoutf+i)
      PHODI(i,1) = alumin1(inoutf+i)
      PHODI(i,2) = alumin2(inoutf+i)
         PHODI_NEW(i,1) = alumin1(inoutf+i)   !!!!! A REVOIR (MPL) PHODI_NEW en fonction bdes SW
         do kk=2,NSW
           PHODI_NEW(i,kk) = alumin2(inoutf+i)
      PBLPRE(i,1) = alumin1(inoutf+i)
      PBLPRE(i,2) = alumin2(inoutf+i)
         PBLPRE_NEW(i,1) = alumin1(inoutf+i)     !!!!! A REVOIR (MPL) PBLPRE_NEW en fonction bdes SW
         do kk=2,NSW
           PBLPRE_NEW(i,kk) = alumin2(inoutf+i)
      PASSIM(i) = 1.0    !!!!! A REVOIR (MPL)
      PRLVW(i) = 1.66
      PPSOL(i) = PAHALE(inoutf+i,1)
      zeroxa1 = (PAHALE(inoutf+i,1)-pplay(inoutf+i,2))/(pplay(inoutf+i,1)-pplay(inoutf+i,2))
      zeroxa2 = 1.0 - zeroxa1
      PRLTAI(i,1) = t(inoutf+i,1) * zeroxa1 + t(inoutf+i,2) * zeroxa2
      PRLTAI(i,KLEV+1) = t(inoutf+i,KLEV)
      PDT0(i) = tsol(inoutf+i) - PRLTAI(i,1)
!$acc end loop
!$acc end parallel

I’ve checked the PGIUG-15.4 to know some more options to produce same floating point accuracy on both CPU and GPU. But didn’t find much relevant options. Can you please guide further here to produce accurate results.

Hi SanBc,

This isn’t unexpected. Try adding “-ta=tesla:nofma” and/or -Mnofma since FMA operations often are the cause of the differing results.


Hi mkcolg,

That solved the issue. Thanks for the help.

I’ve continued to apply pragmas to other part of the code. In there again accuracy issues approx at 12th decimal place. This part of the code involves arithemetic operations along with sqrt(), exp() …functions. I think we have to use some other options along with -ta=tesla:nofma. Can you please suggest which options to use?

Hi SanBc,

You can also add the flag “-ta=tesla:noflushz” to disable flush-to-zero mode for denormals. I’d also recommend adding “-Kieee” to ensure that your CPU results are adhering to IEEE 754.

Another thing to look for are summation in which the order of operations will effect accuracy given rounding error. This cause differences between parallel and sequential code.

Note that FMA operations are actually more accurate given there’s less rounding. You’d also see these differences if you were to compare a x86 FMA enabled system such as a Haswell with a non-FMA enabled x86 system.

You might find this paper on IEEE 754 conformance on NVIDIA GPUs useful: https://developer.nvidia.com/sites/default/files/akamai/cuda/files/NVIDIA-CUDA-Floating-Point.pdf

  • Mat