Float precision error in matrix multiplication application.

zhou0915 · February 26, 2014, 10:20am

When I run the matrix multiplication application in Linux system, I found a float precision error of 5e-1 between GPU and CPU results. My hardware is: Tesle M2075, CUDA 5.5, and regular Intel CPU. Is there someone confronting the same issue? Is ther any solution to avoid this error? Thank you.

zhou0915 · February 26, 2014, 10:24am

Besides, compute capability is of SM 2.0; when the same code running on “Quadro 600” with compute capability 2.1 of Windows system, no error is found between GPU and CPU results.

pasoleatis · February 26, 2014, 11:21am

Cpu does internally everything in 80 bit precision, while gpus are limited to float or double. Differences come in problems where you have summations like Fourier transforms and matrix vector multiplicaton where one can end with errors when the number added are very different from each other. The error will become smaller if you sue double precision o gpu.

zhou0915 · February 27, 2014, 2:22am

But I am wondering why this error only happen in linux + Tesle M2075, but not in windows + Quadro 600; Is it because of the difference in OS or GPU hardware?

Robert_Crovella · February 27, 2014, 2:25am

What is “the matrix multiplication application” ? Is it something you wrote? Can you give us the code that shows how you generate the results, and how you compare to the CPU results?

SPWorley · February 27, 2014, 2:54am

Most CPUs these days are x86-64, which use 64 bit IEEE doubles by default (using the SSE FPU on the scalar values). The GPU’s doubles are also 64 bit IEEE. x87 style 80-bit extended floating point is still available on x86/x86-64 CPUs with most C compilers by using “long double”, which uses a much much slower x87 execution path.

zhou0915 · February 27, 2014, 3:39am

the matrix multiplication application contains only regular operations of multiplication and addition. Results are the same as in: linux + CPU, windows + CPU and windows + Quadro 600. However, when in linux + Tesle M2075, the maximum error reach 5e-1; also in this case, GPU result is always greater, as mentioned in the earlier nVidia official document about the former SM1.0.

SPWorley · February 27, 2014, 4:04am

Are you sure the CUDA code is using floating point everywhere, not doubles, even once?
Even inline constants must be written with the “f” suffix to make sure intermediate answers are not computed in doubles.

The Quadro 600 is a very old card and doesn’t support double precision. The M2075 does. If you accidentally have any double promotions in your CUDA code, the M2075 will happily compute at the higher precision you’ve requested and therefore the results will not match the Quadro 600s which will have automatically demoted all math to floats. The fact that the Quadro does match the CPU implies that the CPU code is indeed using floats and not doubles.

It’s just a hypothesis, but it would match your clues.

zhou0915 · February 27, 2014, 5:21am

Yes, I have modified the CPU code to perform the double precision operation, but it still does not match the result from M2075.

zhou0915 · February 27, 2014, 5:57am

This issue has been solved, after replacing the * operation with the built-in function __fmul_rn; so the default round mode of the * operation is round up. Thanks to all.

Robert_Crovella · February 27, 2014, 6:11am

The default rounding mode should be rn.

[url]Floating Point and IEEE 754 :: CUDA Toolkit Documentation

njuffa · February 27, 2014, 8:03am

Based on message #10 above you are likely observing the effects of contraction of floating-point multiply and dependent floating-point add into an FMAD instruction (on sm_1x architecture) or FFMA instruction (on sm_20 and later architectures). This is done for performance reasons, and in the case of FFMA, for reasons of improved average accuracy. You can turn off FMA/FMAD-merging by passing the compiler flag -fmad=false.

Note that FFMA (single-precision fused multiply-add) will in general provide better accuracy than separate FMUL and FADD instructions, and that only the latest x86 CPUs include a fused multiply-add instruction. This means that single-precision GPU computations using FFMA are often more accurate than equivalent single-precision CPU computations, and results will not match between CPU and GPU. It is also possible that CPU computations may perform some intermediate computation in higher than single precision. You may want to read NVIDIA’s white paper: [url]http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf[/url]

The FMAD instruction of sm_1x GPUs delivers different results from the FFMA instructions in later architectures, so results will typically differ between GPUs with sm_1x and >= sm_20. While the FMAD instruction is a fused operation like FFMA, it differs from the latter in that the intermediate product is truncated towards zero before the final addition that is rounded to nearest-or-even.

By contrast the FFMA instruction retains the full (unrounded) intermediate product prior to the addition, then applies a single rounding to the sum, where the rounding mode can be any of the four IEEE-754 rounding modes; it defaults to round-to-nearest-even. This means single-precision computation on sm_1x GPUs is often less accurate than on later GPU architectures, and may be less accurate than single-precision CPU computation.

To correctly assess the accuracy of GPU and CPU computations relative to each other, you would need to compare to a higher-precision reference. An analogy would be trying to assess the accuracy of two ordinary wrist watches. This is not possible by comparing the time from the two wrist watches to each other, instead both need to be compared to the time from a higher precision reference such as an atomic clock.

zhou0915 · February 27, 2014, 10:07am

In my test environment, the default round mode of the * operation in M2075 is round up, while the x86 CPUs are of round near; is it a bug or do the other type GPUs have not this issue?

zhou0915 · February 27, 2014, 10:10am

Also, if the default setting is using -fmad=true, and FFMA are often more accurate than equivalent single-precision CPU, the test results should not always be greater.

njuffa · February 27, 2014, 10:55am

The default rounding mode for FMUL, FADD, and FFMA is round-to-nearest-even. You can easily see this if you inspect the PTX and the SASS (machine code). To get round-to-positive infinity (“up”), you would have to code using intrinsics: __fadd_ru(), __fmul_ru(), __fmaf_ru().

Depending on your Linux platform and your host compiler settings, the floating-point computations in Linux may be performed in x87 registers using extended precision (paseolatis already mentioned this). Windows on the other hand configures the x87 FPU for double precision. In other words the host results would differ between the two platforms. The results from the Quadro 600 (an sm_21 device) and the M0275 (an sm_20 device), should match bit-for-bit as long as the same code, the same tool chain, and the same compiler settings are used (and there are no bugs in the GPU code).

Topic		Replies	Views
FMA precision issue CUDA Programming and Performance	9	19308	November 21, 2010
Is there a difference between GPU double precision and CPU double precision? CUDA Programming and Performance	14	10450	November 26, 2009
precision CUDA Programming and Performance	3	2614	December 16, 2008
Floating Point Accuracy CUDA Programming and Performance	11	30409	April 6, 2013
Question regarding Precision Issues in BLAS CUDA Programming and Performance	9	8504	March 4, 2010
Precision of floats does CUDA use half precision instead of single precision for floats? CUDA Programming and Performance	5	2272	March 15, 2010
Floating-point precision problems CUDA Programming and Performance	14	4344	January 7, 2011
Why accuracy CPU and GPU not equal? CUDA Programming and Performance	6	10876	October 28, 2014
Accuracy in GPU floating point calculations CUDA Programming and Performance	35	8201	September 9, 2011
More information about double precision in Guide? CUDA Programming and Performance	4	4036	May 30, 2008

Float precision error in matrix multiplication application.

Related topics