# Differences between GPU and CPU results with -nofma

Hi,

In view of validating code generated with pgi acc directives we are doing some systematic comparison betweem GPU and CPU outputs. In order to estimate what differences one may expect as a result of potential rounding error propagation we first run the CPU code with some 10^-15 relative perturbation to the input fields and compare with the orginal results.
We then run the GPU code and evaluate differences with respect to our previous estimates.

This is clearly not a perfect method, but so far it has been working fine provided that the GPU code is compiled with

-ta=nvidia,nofma.

In one of our test codes however, the GPU results are off the expected values. I have been able to reproduce this

beahviour here:

``````program main
implicit none
integer*4 :: N,nlev,ip,k
real*8, allocatable :: pbbr(:,:), pbbr_ref(:,:), pti(:,:)
real*8 :: psig, planck(3)

N=1E3
nlev=4

allocate(pbbr(N,nlev),pbbr_ref(N,nlev))
allocate(pti(N,nlev))

!----------------------------------
!init
pti(:,:)=2.730661634310878D+02
planck(1)=1.57656D0
planck(2)=-7.114856D-3
planck(3)=9.0822D-6
psig=5.6697D-8

!----------------------------------
!1: compute on cpu

DO k = 1, nlev
DO ip = 1, N
pbbr_ref(ip,k)= ( planck(1) + pti(ip,k)                 &
* ( planck(2) + pti(ip,k)*planck(3) ) ) &
* psig * (pti(ip,k)**2)**2
ENDDO
ENDDO
!----------------------------------
!1: compute on gpu

!\$acc region do seq, copyin(pti), copyout(pbbr), copyin(planck)
DO k = 1, nlev
!\$acc do parallel vector(256)
DO ip = 1, N
pbbr(ip,k)= ( planck(1) + pti(ip,k)                 &
* ( planck(2) + pti(ip,k)*planck(3) ) ) &
* psig * (pti(ip,k)**2)**2
ENDDO
ENDDO

print*, 'pbbr_cpu',  pbbr_ref(1,1)
print*, 'pbbr',  pbbr(1,1)
print*, 'Rel. diff CPU/GPU:', (pbbr_ref(1,1)-pbbr(1,1))/pbbr_ref(1,1)

end program main
``````

If now compile the code with:
pgf90 -r8 -O2 -Kieee -ta=nvidia,nofma -o test_pbbr test_pbbr.f90

I get:
./test_pbbr
pbbr_cpu 98.02137331998863
pbbr 98.02138173727505
Rel. diff CPU/GPU: -8.5871949538446817E-008

Note that the difference is not huge, but still puzzling. Also I don’t think it is related to the hardware since I also tried to implement this code in plain CUDA, and the relative difference between GPU and CPU was of the order of 10^-16.

On remarkable thing is that if I now compile without the nofma:

pgf90 -r8 -O2 -Kieee -ta=nvidia -o test_pbbr test_pbbr.f90
./test_pbbr
pbbr_cpu 98.02137331998863
pbbr 98.02137331998860
Rel. diff CPU/GPU: 2.8995420557536935E-016

The difference is in the expected range. (This is however not a satisfying solution as other kernels show larger differences without the nofma option).

Any idea on what is going on here ? Also if you have some general suggestions on which compile options I should use both for the CPU and GPU code for validation purpose, this would be great.