Double Precision errors

Hi, I am Arun. I am working with cuda fortran. In one of my code i am using double precision variables of type real(8), but unfortunately the results computed by CPU and the GPU are not exactly the same. Why does double precision computation differs in GPUs from the CPUs. Is there any solution to this?

Hi Arun,

There could be several reasons. For example, running in parallel can cause ordering difference which in turn causes rounding differences.

Also, what CPU are you using? The GPU uses FMAs by default so if you’re CPU is not using FMAs, this can cause differences. Try adding “-Mcuda=nofma” to your compile flag to disable FMA on the GPU.

Also try adding “-Kieee” to have the compiler enforce IEEE754. Though, precision differences due to parallel operations and FMA wont be effected by this flag.

-Mat

Hi Mat.
Regarding CPU, my laptop has intel core i5 processor.
Intel® Core™ i5-4210U CPU @ 1.70GHz.

The GPU insatalled is : GeForce 820M with compute capability 2.1.

Regarding computation difference with double precision, for example a function does some computations and writes different results to four 2D arrays. In one of the arrays the maximum value of the error between the cpu and gpu arrays is of the order of 10^(-14) but the rest of arrays show absolute zero error.

I think my CPU does use FMAs because when I disabled the FMAs using -Mcuda=nofma the results were worse, all the arrays had some finite error of the order of 10^(-8).

Adding Kieee also worsened the results. I checked the function algorithm has no issues, it is correct. What could be the problem? Any other suggestion?

Intel® Core™ i5-4210U CPU @ 1.70GHz.

This is a haswell architecture so does have FMA.

Adding Kieee also worsened the results

This changed your CPU results? -Kieee disables optimization that would cause precision differences so should be more accurate. Hence my guess is that your algorithm is numerically sensitive. Are you doing any accumulation such as summation? Are there any uninitialized variables?

Can you post or send to PGI customer service (trs@pgroup.com) the code so I can take a look?

-Mat

Hi. I have mailed you the code on the given email address. Please find the attachment with the mail. The code is a shortened version of a very lengthy program and contains 2 functions a serial and a parallel and their results are compared.

I have explained as much as possible about the program in the mail and comments in the code.

You can see that on execution even using fma’s there is still a loss of accuracy.

Well adding -Kieee in this part does not affect the solution as you stated. I did some mistake in concluding that -Kieee worsened the solution, I am sorry for that.

Hi Arun,

Thanks for the example. It looks like it is an FMA issue but just opposite of what I guessed. Here FMA is being generated for the host code but not the device. Hence, you can add “-Mnofma” to disable all FMA code generation:

% pgf90 -Mcuda precision.cuf general.cuf -Mnofma ; a.out
precision.cuf:
general.cuf:
Device name:Tesla K80
Compute capability : 3.7
 errf in zm(surfzmgradz_cudaf)     =     0.000000000000000
 errf in surf(surfzmgradz_cudaf)   =     0.000000000000000
 errf in gradz(surfzmgradz_cudaf)  =     0.000000000000000
 errf in gradz2(surfzmgradz_cudaf) =     0.000000000000000

-Mat