Precision in Fortran intrinsics EXP.

Hi,
I’m porting a scientific application written in Fortran to GPUs.
After I porting a subroutine into a device subroutine, I found that the result are not identical with the one produced by CPU.
After some experiments, I found that when executing Fortran intrinsics “EXP()”, the GPU and CPU producing different result.

Here is the code:

REAL , INTENT(IN) :: DT ! DT is a single precision float
REAL , INTENT(IN) :: HSCALE ! HSCALE is a single precision float
REAL , INTENT(INOUT) :: RAUTO ! RAUTO is where the result store, and is also single precision float

RAUTO=EXP(-DT/HSCALE) ! The input DT=60.0, HSCALE=10800.0

This code simply means that RAUTO =e^(-DT/HSCALE), and DT=60.0, HSCALE=10800.0

When compiling using pgi fortran compiler for cc6.0(pascal)(Command lines are: pgfortran -c -O3 -Mfree -Mcuda -Mcuda=cc60 filename), the Result RAUTO is 0.994459807873 (I use format to output more digits than single precision number default).
When using the same compiler for x64 CPU using emulation flag(Command lines are: pgfortran -c -O3 -Mfree -Mcuda -Mcuda=emu filename), the result is 0.994459867477.

And if I compile the whole program using emulation flag(that is running cuda threads on CPU), the output result are identical to origin CPU program. So together with the finding above, I believe that the differences are due to how GPU and CPU calculate EXP.

So, is that true that CPU and GPU are calculating EXP intrinsics differently? To my knowledge, GPU has dedicated Hardware called “Special function Units” that calculate these intrinsics, while CPU depends on compiler’s software implementation to do the calculation.

By the way, are there any methods to make CPU and GPU produce identical results for these intrinsics?

Thank you!

Yes. Separate math libraries are used for the host toolchain and for the device toolchain. exp() is computed by software subroutines in either case. Keep in mind that floating-point literals as shown above are single precision in Fortran, while they are double precision in C++ (CUDA), which then forces the computation to be single precision vs double precision, so obviously results will differ.

0.99445984800489675087954734 // reference
0.994459807873 // CUDA
0.994459867477 // Fortran

Both results reported appear to be accurate to single precision (<= 7 decimal digits). Note that contrary to basic arithmetic operations which are standardized by the IEEE-754 floating-point standard, transcendental functions like exp() do not typically return correctly rounded results, and therefore different results from different libraries are common.

This comparison of single-precision and double-precision computation may be of interest. First, the single-precision case:

Fortran

PROGRAM TEST
      IMPLICIT NONE
      REAL*4 RAUTO, DT, HSCALE
      DT = 60.0
      HSCALE = 10800.0
      RAUTO = EXP(-DT/HSCALE)
      WRITE (*, '("RAUTO=", (F19.16))'), RAUTO
      STOP
      END

CUDA

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

__global__ void kernel (void)
{
    float rauto, dt, hscale
    dt = 60.0f;
    hscale = 10800.0f;
    rauto = exp(-dt/hscale);
    printf ("rauto= %.16f\n", rauto);
}

int main (void)
{
    kernel<<<1,1>>>();
    cudaDeviceSynchronize();
    return EXIT_SUCCESS;
}

Output (note: accurate to single precison but more digits printed):

RAUTO= 0.9944598674774170    // Fortran
rauto= 0.9944598078727722    // CUDA

Now, the same in double precision:

PROGRAM TEST
      IMPLICIT NONE
      REAL*8 RAUTO, DT, HSCALE
      DT = 60.0d0
      HSCALE = 10800.0d0
      RAUTO = EXP(-DT/HSCALE)
      WRITE (*, '("RAUTO=", (F19.16))'), RAUTO
      STOP
      END
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

__global__ void kernel (void)
{
    double rauto, dt, hscale;
    dt = 60.0;
    hscale = 10800.0;
    rauto = exp(-dt/hscale);
    printf ("rauto= %.16f\n", rauto);
}

int main (void)
{
    kernel<<<1,1>>>();
    cudaDeviceSynchronize();
    return EXIT_SUCCESS;
}

Output:

RAUTO= 0.9944598480048967  // Fortran
rauto= 0.9944598480048967  // CUDA

Thank you for detailed explanation!
Now I understand that the different results are due to different behavior of GPU/CPU libraries as GPU library will force to do exp() in double precision than convert it back to single precision while CPU library doesn’t do such a job. And also the non-standard rounding method will make difference in final single precision result. (I also print out “RAUTO” in hexadecimal format and the CPU output is “3F7E94EC”(0.9944599) and and the GPU output is “3F7E94EB”(0.9944598), this may due to different rounding method from different results )

Since the original program is using single precision variables, I think I might have to keep these differences. Despite the difference in final output of CPU/GPU version of program, can I still show that my porting is correct if I compile the cuda Fortran program using “Mcuda=emu” flag and the final result is identical with the original program? I thought that is executing the same cuda program but using host processor and libraries (exclude the precision issue).

Thank you!

Sorry, I must have explained it poorly. C++ (and therefore CUDA) as well as modern Fortran both use generic function names, where the type of the argument determines whether the function invoked is the single-precision or the double-precision version. Also, keep in mind that, for example, 1.0 in Fortran is a single-precision floating-point literal constant, but 1.0 in C++ is a double-precision floating-point literal. So:

EXP(1.0) // Fortran: single-precision argument causes single-precision exponential to be invoked
exp(1.0) // C+/CUDA: double-precision argument causes double-precision exponential to be invoked

In your code it looks like you are using single-precision variables, so both CPU and GPU computation use single-precision computation here.

Your approach of looking at the hexadecimal encoding of the two results shows that there is only a 1-bit difference in their least significant bit (called a “1 ulp” difference for those more deeply involved with floating-point computation; see Wikipedia entry for ‘ulp’).

Such differences are common between different math libraries. Something one should be aware of but nothing to be concerned about. Other than for well-defined integer computation, results of non-trivial floating-point computations often do not match bit-wise across computational platforms.

In establishing the correctness of a computation, one cannot directly compare two implementations computing at the same precision. If there is a discrepancy, which one is more correct? We cannot tell. Either use a higher-precision reference implementation, or use a self-checking metric such as the residue from plugging the computed solution back into the original equation(s). This latter technique is often used for solvers.