Differences between GPU and CPU results with -nofma

Hi,

In view of validating code generated with pgi acc directives we are doing some systematic comparison betweem GPU and CPU outputs. In order to estimate what differences one may expect as a result of potential rounding error propagation we first run the CPU code with some 10^-15 relative perturbation to the input fields and compare with the orginal results.
We then run the GPU code and evaluate differences with respect to our previous estimates.

This is clearly not a perfect method, but so far it has been working fine provided that the GPU code is compiled with

-ta=nvidia,nofma.

In one of our test codes however, the GPU results are off the expected values. I have been able to reproduce this

beahviour here:

program main
  implicit none
  integer*4 :: N,nlev,ip,k
  real*8, allocatable :: pbbr(:,:), pbbr_ref(:,:), pti(:,:)
  real*8 :: psig, planck(3)

  N=1E3
  nlev=4
  
allocate(pbbr(N,nlev),pbbr_ref(N,nlev))
allocate(pti(N,nlev))


!----------------------------------
!init
pti(:,:)=2.730661634310878D+02
planck(1)=1.57656D0
planck(2)=-7.114856D-3
planck(3)=9.0822D-6
psig=5.6697D-8

!----------------------------------
!1: compute on cpu

    DO k = 1, nlev
       DO ip = 1, N
          pbbr_ref(ip,k)= ( planck(1) + pti(ip,k)                 &
               * ( planck(2) + pti(ip,k)*planck(3) ) ) &
               * psig * (pti(ip,k)**2)**2
       ENDDO
    ENDDO
!----------------------------------
!1: compute on gpu

    !$acc region do seq, copyin(pti), copyout(pbbr), copyin(planck)
    DO k = 1, nlev
       !$acc do parallel vector(256)
       DO ip = 1, N
          pbbr(ip,k)= ( planck(1) + pti(ip,k)                 &
               * ( planck(2) + pti(ip,k)*planck(3) ) ) &
               * psig * (pti(ip,k)**2)**2
       ENDDO
    ENDDO

print*, 'pbbr_cpu',  pbbr_ref(1,1) 
print*, 'pbbr',  pbbr(1,1) 
print*, 'Rel. diff CPU/GPU:', (pbbr_ref(1,1)-pbbr(1,1))/pbbr_ref(1,1)



end program main

If now compile the code with:
pgf90 -r8 -O2 -Kieee -ta=nvidia,nofma -o test_pbbr test_pbbr.f90

I get:
./test_pbbr
pbbr_cpu 98.02137331998863
pbbr 98.02138173727505
Rel. diff CPU/GPU: -8.5871949538446817E-008


Note that the difference is not huge, but still puzzling. Also I don’t think it is related to the hardware since I also tried to implement this code in plain CUDA, and the relative difference between GPU and CPU was of the order of 10^-16.


On remarkable thing is that if I now compile without the nofma:

pgf90 -r8 -O2 -Kieee -ta=nvidia -o test_pbbr test_pbbr.f90
./test_pbbr
pbbr_cpu 98.02137331998863
pbbr 98.02137331998860
Rel. diff CPU/GPU: 2.8995420557536935E-016

The difference is in the expected range. (This is however not a satisfying solution as other kernels show larger differences without the nofma option).

Any idea on what is going on here ? Also if you have some general suggestions on which compile options I should use both for the CPU and GPU code for validation purpose, this would be great.

Thanks for your help,

Xavier

Hi Xavier,

It looks like a problem with the compiler’s code generation. For some reason the compiler is using a float to store the result of pow operation when “nofma” is used thus causing rounding errors. Without “nofma” a double is used.

I have filled this problem as TPR#18262 and sent it off to our engineers. Most likely it will be easy to fix.

Thanks for letting us know!
Mat