Hi, I wrote a small program about four matrix multiplication, I was using floating point. When I ran the program on Tesla C1060, the results match the CPU results. However, when I ran it on C2050, there misses compare with the CPU results. Why the same program can get different results? Does the C2050 suppose to be more accurate than the C1060?
anybody has any clue?
There are a few operations that can give different results on different hardware, see Section 7.4 Relative Error as ULPs in the specification. These operations are primarily trigonometric and exponentials and such. Both addition and multiplication, the work horses of a matrix multiplication, are listed as “Correctly rounded”. However, when dealing with floating point number one should not assume that the same sequence of operations reach the same value for different hardware, or that two equivalent sequences of operations reach the same value even on the same hardware.
How much do the results differ? For 32-bit floating point, if the difference is on the seventh place or so of the value printed in scientific notation (#.######E##), then I would say that it is close enough.
If this the matrix multiplies in question uses single-precision data, the differences are likely due to improvements made in FERMI-class GPUs to the single-precision multiply-add capability. Pre-Fermi hardware used a multiply add that truncated the intermediate product to single precision before the final add. Fermi hardware uses a fused multiply-add (FMA) as specified by IEEE-754 (2008). The intermediate product in neither rounded nor truncated, instead all bits of the product enter into the final addition, resulting is performing two floating-point operations with a single rounding.
On average, the new single-precision FMA operation will improve the accuracy of computations compared with the previously used FMAD. You may find the following whitepaper useful that describes this and other issues affecting floating-point computations on GPUs: