I’m porting a scientific application written in Fortran to GPUs.
After I porting a subroutine into a device subroutine, I found that the result are not identical with the one produced by CPU.
After some experiments, I found that when executing Fortran intrinsics “EXP()”, the GPU and CPU producing different result.
Here is the code:
REAL , INTENT(IN) :: DT ! DT is a single precision float
REAL , INTENT(IN) :: HSCALE ! HSCALE is a single precision float
REAL , INTENT(INOUT) :: RAUTO ! RAUTO is where the result store, and is also single precision float
RAUTO=EXP(-DT/HSCALE) ! The input DT=60.0, HSCALE=10800.0
This code simply means that RAUTO =e^(-DT/HSCALE), and DT=60.0, HSCALE=10800.0
When compiling using pgi fortran compiler for cc6.0(pascal)(Command lines are: pgfortran -c -O3 -Mfree -Mcuda -Mcuda=cc60 filename), the Result RAUTO is 0.994459807873 (I use format to output more digits than single precision number default).
When using the same compiler for x64 CPU using emulation flag(Command lines are: pgfortran -c -O3 -Mfree -Mcuda -Mcuda=emu filename), the result is 0.994459867477.
And if I compile the whole program using emulation flag(that is running cuda threads on CPU), the output result are identical to origin CPU program. So together with the finding above, I believe that the differences are due to how GPU and CPU calculate EXP.
So, is that true that CPU and GPU are calculating EXP intrinsics differently? To my knowledge, GPU has dedicated Hardware called “Special function Units” that calculate these intrinsics, while CPU depends on compiler’s software implementation to do the calculation.
By the way, are there any methods to make CPU and GPU produce identical results for these intrinsics?