Floating point results mismatch: CPU vs OpenACC

SanBc · May 18, 2015, 7:08am

Hi,

The results between the CPU results and openACC GPU results are not matching to each other. Both executables are compiled with PGI-15.4 Fortran compiler and with -Kieee option. The vimdiff output of the same is copied below:

L, CFL[xyz]max: 1 0.5683299913571619 | L, CFL[xyz]max: 1 0.5683299913571619
0.1273464758709**242** 0.000000000000000 | 0.1273464758709**578** 0.000000000000000
L, CFL[xyz]max: 2 0.6531875907627893 | L, CFL[xyz]max: 2 0.6531875907627893
0.1595491883002**556** 9.1393459670539370E-002 | 0.1595491883002**926** 9.1393459670539370E-002
L, CFL[xyz]max: 3 0.6136218439579227 | L, CFL[xyz]max: 3 0.6136218439579227
0.13731536255**29625** 0.1561959406772873 | 0.13731536255**30031** 0.1561959406772873
L, CFL[xyz]max: 4 0.5379380549130**823** | L, CFL[xyz]max: 4 0.5379380549130**339**
0.1341207044526441 0.1989663002452234 | 0.1341207044526441 0.1989663002452234
L, CFL[xyz]max: 5 0.57548155648**40166** | L, CFL[xyz]max: 5 0.5754815564839401

The code of the applied OpenACC pragmas is pasted below:

DO j = 1, numgbr
    inoutf = cldon*(j-1)
!$acc parallel firstprivate(inoutf)
!$acc loop
    DO i = 1, cldon
      zebfrcre(i) = frcre(inoutf+i)
!     zebfrcre(i) = 1.     !!!!!!  essai MPL 19052010
      zerm0(i) = rm0(inoutf+i)
      PHODI(i,1) = alumin1(inoutf+i)
      PHODI(i,2) = alumin2(inoutf+i)
!
         PHODI_NEW(i,1) = alumin1(inoutf+i)   !!!!! A REVOIR (MPL) PHODI_NEW en fonction bdes SW
         do kk=2,NSW
           PHODI_NEW(i,kk) = alumin2(inoutf+i)
         enddo
      PBLPRE(i,1) = alumin1(inoutf+i)
      PBLPRE(i,2) = alumin2(inoutf+i)
!
         PBLPRE_NEW(i,1) = alumin1(inoutf+i)     !!!!! A REVOIR (MPL) PBLPRE_NEW en fonction bdes SW
         do kk=2,NSW
           PBLPRE_NEW(i,kk) = alumin2(inoutf+i)
         enddo
      PASSIM(i) = 1.0    !!!!! A REVOIR (MPL)
      PRLVW(i) = 1.66
      PPSOL(i) = PAHALE(inoutf+i,1)
      zeroxa1 = (PAHALE(inoutf+i,1)-pplay(inoutf+i,2))/(pplay(inoutf+i,1)-pplay(inoutf+i,2))
      zeroxa2 = 1.0 - zeroxa1
      PRLTAI(i,1) = t(inoutf+i,1) * zeroxa1 + t(inoutf+i,2) * zeroxa2
      PRLTAI(i,KLEV+1) = t(inoutf+i,KLEV)
      PDT0(i) = tsol(inoutf+i) - PRLTAI(i,1)
    ENDDO
!$acc end loop
!$acc end parallel

I’ve checked the PGIUG-15.4 to know some more options to produce same floating point accuracy on both CPU and GPU. But didn’t find much relevant options. Can you please guide further here to produce accurate results.

MatColgrove · May 22, 2015, 10:38pm

Hi SanBc,

This isn’t unexpected. Try adding “-ta=tesla:nofma” and/or -Mnofma since FMA operations often are the cause of the differing results.

-Mat

SanBc · May 26, 2015, 6:56pm

Hi mkcolg,

That solved the issue. Thanks for the help.

I’ve continued to apply pragmas to other part of the code. In there again accuracy issues approx at 12th decimal place. This part of the code involves arithemetic operations along with sqrt(), exp() …functions. I think we have to use some other options along with -ta=tesla:nofma. Can you please suggest which options to use?

MatColgrove · May 26, 2015, 8:51pm

Hi SanBc,

You can also add the flag “-ta=tesla:noflushz” to disable flush-to-zero mode for denormals. I’d also recommend adding “-Kieee” to ensure that your CPU results are adhering to IEEE 754.

Another thing to look for are summation in which the order of operations will effect accuracy given rounding error. This cause differences between parallel and sequential code.

Note that FMA operations are actually more accurate given there’s less rounding. You’d also see these differences if you were to compare a x86 FMA enabled system such as a Haswell with a non-FMA enabled x86 system.

You might find this paper on IEEE 754 conformance on NVIDIA GPUs useful: https://developer.nvidia.com/sites/default/files/akamai/cuda/files/NVIDIA-CUDA-Floating-Point.pdf

Mat

Topic		Replies	Views
OpenACC Fortran reproductibility between CPU and GPU Legacy PGI Compilers	2	2962	October 14, 2016
Output using ta=host or ta=tesla:cc60 Legacy PGI Compilers	5	4999	December 20, 2017
OpenACC diff between GPU + CPU codes Legacy PGI Compilers	5	4100	May 31, 2012
Slight difference in CPU-serial and GPU-parallel solutions nvc, nvc++ and nvfortran	2	697	October 1, 2021
OPENACC changes value of array Legacy PGI Compilers	12	9813	May 17, 2016
Can -acc generate different numerical results ? Legacy PGI Compilers	1	1316	March 25, 2019
Differences between GPU and CPU results with -nofma Legacy PGI Compilers	1	2397	November 4, 2011
Double Precision errors Legacy PGI Compilers	5	2674	June 12, 2018
Check performance Legacy PGI Compilers	4	3315	September 28, 2017
About GPU calculation accuracy Legacy PGI Compilers	4	582	August 18, 2020

Floating point results mismatch: CPU vs OpenACC

Related topics