When I run the matrix multiplication application in Linux system, I found a float precision error of 5e-1 between GPU and CPU results. My hardware is: Tesle M2075, CUDA 5.5, and regular Intel CPU. Is there someone confronting the same issue? Is ther any solution to avoid this error? Thank you.
Besides, compute capability is of SM 2.0; when the same code running on “Quadro 600” with compute capability 2.1 of Windows system, no error is found between GPU and CPU results.
Cpu does internally everything in 80 bit precision, while gpus are limited to float or double. Differences come in problems where you have summations like Fourier transforms and matrix vector multiplicaton where one can end with errors when the number added are very different from each other. The error will become smaller if you sue double precision o gpu.
But I am wondering why this error only happen in linux + Tesle M2075, but not in windows + Quadro 600; Is it because of the difference in OS or GPU hardware?
What is “the matrix multiplication application” ? Is it something you wrote? Can you give us the code that shows how you generate the results, and how you compare to the CPU results?
Most CPUs these days are x86-64, which use 64 bit IEEE doubles by default (using the SSE FPU on the scalar values). The GPU’s doubles are also 64 bit IEEE. x87 style 80-bit extended floating point is still available on x86/x86-64 CPUs with most C compilers by using “long double”, which uses a much much slower x87 execution path.
the matrix multiplication application contains only regular operations of multiplication and addition. Results are the same as in: linux + CPU, windows + CPU and windows + Quadro 600. However, when in linux + Tesle M2075, the maximum error reach 5e-1; also in this case, GPU result is always greater, as mentioned in the earlier nVidia official document about the former SM1.0.
Are you sure the CUDA code is using floating point everywhere, not doubles, even once?
Even inline constants must be written with the “f” suffix to make sure intermediate answers are not computed in doubles.
The Quadro 600 is a very old card and doesn’t support double precision. The M2075 does. If you accidentally have any double promotions in your CUDA code, the M2075 will happily compute at the higher precision you’ve requested and therefore the results will not match the Quadro 600s which will have automatically demoted all math to floats. The fact that the Quadro does match the CPU implies that the CPU code is indeed using floats and not doubles.
It’s just a hypothesis, but it would match your clues.
Yes, I have modified the CPU code to perform the double precision operation, but it still does not match the result from M2075.
This issue has been solved, after replacing the * operation with the built-in function __fmul_rn; so the default round mode of the * operation is round up. Thanks to all.
The default rounding mode should be rn.
Based on message #10 above you are likely observing the effects of contraction of floating-point multiply and dependent floating-point add into an FMAD instruction (on sm_1x architecture) or FFMA instruction (on sm_20 and later architectures). This is done for performance reasons, and in the case of FFMA, for reasons of improved average accuracy. You can turn off FMA/FMAD-merging by passing the compiler flag -fmad=false.
Note that FFMA (single-precision fused multiply-add) will in general provide better accuracy than separate FMUL and FADD instructions, and that only the latest x86 CPUs include a fused multiply-add instruction. This means that single-precision GPU computations using FFMA are often more accurate than equivalent single-precision CPU computations, and results will not match between CPU and GPU. It is also possible that CPU computations may perform some intermediate computation in higher than single precision. You may want to read NVIDIA’s white paper: http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf
The FMAD instruction of sm_1x GPUs delivers different results from the FFMA instructions in later architectures, so results will typically differ between GPUs with sm_1x and >= sm_20. While the FMAD instruction is a fused operation like FFMA, it differs from the latter in that the intermediate product is truncated towards zero before the final addition that is rounded to nearest-or-even.
By contrast the FFMA instruction retains the full (unrounded) intermediate product prior to the addition, then applies a single rounding to the sum, where the rounding mode can be any of the four IEEE-754 rounding modes; it defaults to round-to-nearest-even. This means single-precision computation on sm_1x GPUs is often less accurate than on later GPU architectures, and may be less accurate than single-precision CPU computation.
To correctly assess the accuracy of GPU and CPU computations relative to each other, you would need to compare to a higher-precision reference. An analogy would be trying to assess the accuracy of two ordinary wrist watches. This is not possible by comparing the time from the two wrist watches to each other, instead both need to be compared to the time from a higher precision reference such as an atomic clock.
In my test environment, the default round mode of the * operation in M2075 is round up, while the x86 CPUs are of round near; is it a bug or do the other type GPUs have not this issue?
Also, if the default setting is using -fmad=true, and FFMA are often more accurate than equivalent single-precision CPU, the test results should not always be greater.
The default rounding mode for FMUL, FADD, and FFMA is round-to-nearest-even. You can easily see this if you inspect the PTX and the SASS (machine code). To get round-to-positive infinity (“up”), you would have to code using intrinsics: __fadd_ru(), __fmul_ru(), __fmaf_ru().
Depending on your Linux platform and your host compiler settings, the floating-point computations in Linux may be performed in x87 registers using extended precision (paseolatis already mentioned this). Windows on the other hand configures the x87 FPU for double precision. In other words the host results would differ between the two platforms. The results from the Quadro 600 (an sm_21 device) and the M0275 (an sm_20 device), should match bit-for-bit as long as the same code, the same tool chain, and the same compiler settings are used (and there are no bugs in the GPU code).