We are doing double precision Matrix Matrix Multiplication on Tesla C2050 using CUBLAS 3.2
and the same computation on CPU (Intel Xeon X5450 Dual Socket Quad Core System) using
Intel MKL 10.2 for comparing the results of GPU against the CPU. But we are getting the
divergence in the GPU results against the CPU.
We could see only correct results till ten decimal points. and after that there is deviation.
Could you please help us to decided the computation over the divergence
We are using the drand48() function to generate the Input Matrices A,B.
I would like to add a word of caution about establishing the correctness of GPU results simply by comparing to CPU results computed at the same precision. Obviously there will be a certain amount of error in CPU results as well, and a simple comparison of the GPU and CPU results does not establish how much of the total difference is to be attributed to the error of each platform.
I have handled several reports of “incorrect” GPU results where it turned out that the fairly large differences between CPU and GPU were due to accumulated error on the CPU side, which was larger than the error on the GPU side. I found that most such scenarios could be traced back to two mechanisms:
(1) The use of FMA (fused multiply-add) on the GPU. This reduces overall rounding error and can mitigate effects of subtractive cancellation.
(2) The use of summing via tree-like reduction on the GPU which has a tendency to add quantities of similar magnitude in each step.
I consider the comparison with a high-precision reference (all intermediate computation is performed in double-double, or with a multiple precision library) the final arbiter as to which set of results is the more accurate one, and for establishing the actual error for a given platform.
I have to agree. Albert Einstein apparently once said that “a man with one watch always knows the correct time, but a man with two watches is never sure”, and it applies here too. The underlying assumption that something like MKL should be the final arbiter of whether a GPU result is good or not is generally not a great idea. As Norbert points out, GPU operations like GEMM can actually wind up being more accurate than equivalent calculations done on the CPU, because of fused multiply adds and the structure of the algorithms, which tends to suffer less from truncation error during summation.
I would suggest using something like the 106 bit “double double” gemm routine in XBLAS to compute some reference solutions and compute relative errors of both your MKL and CUBLAS solution using the formula Lung Sheng Chien suggested. That will give a more useful indication of what the comparative accuracy of CUBLAS and MKL is.
You can easily write two programs that run on the same (IEEE 754 compliant) CPU yet give different results, as rounding of results may depend on the order of operations.
Since parallel execution almost by definition involves reordering of operations (unless you write the sequential code to exactly match the parallel version), you really can’t expect the same results.
And yes, every single rounded operation should result in a relative error of about 1e-16 max (unless denormals are involved). However, errors can accumulate when operating on larger matrices.
The design of floating point is such that fractional error grows predictably in multiplication and division (up to the limits tera mention), but addition and subtraction instead grow the absolute error in small increments. When the size of the absolute error becomes comparable to your answer, then you suddenly find you have an unexpectedly large fractional error.
Here’s a simple example that you can play with in any Python interpreter (which use double precision for all floating point operations):
>>> a = 1.0 + 2 * 10**-15
>>> b = 1.0 + 10**-15
>>> c = a - b
>>> print "Fractional error:", 1.0 - c/10**-15
Fractional error: 0.1118215803
So with only a handful of operations, I’ve produced an 11% error by doing the worst thing possible: adding numbers which have very different magnitudes followed by subtracting numbers which are nearly identical.