# GPU DGEMM

We are doing double precision Matrix Matrix Multiplication on Tesla C2050 using CUBLAS 3.2
and the same computation on CPU (Intel Xeon X5450 Dual Socket Quad Core System) using
Intel MKL 10.2 for comparing the results of GPU against the CPU. But we are getting the
divergence in the GPU results against the CPU.

We could see only correct results till ten decimal points. and after that there is deviation.

We are using the drand48() function to generate the Input Matrices A,B.

So you are getting absolute errors of about 10^-11? What are the relative errors?

Suppose you do C_gpu := \alpha * A * B + \beta * C
A is nxn and B is nxn and C is nxn

then define Y := 1.01*(n+1)*|\alpha| * |A| * |B| + |\beta| * |C|

where |A| means absolute value of each component of A

Then compute
D = |C_gpu - C_cpu|./Y
where ./ is componentwise division.

check if D(i,j) < 2*eps for all i,j

where eps = machine zero, it is 1.E-16 in double precision.

I am getting relative error 10^-15 …

hi…

Thanks for the reply … I’ll try with this…

That is getting pretty close to the double precision epsilon value, which is 2^-53, or about 2.22e-16…

I need your sugesstion … I would like to keep one threshold value, for coming to the conculsion that

the computation done on GPU is correct . Means the GPU DGEMM test is passed. What is your sugesstion

for that how much precision value I should keep?

Thanks

I would like to add a word of caution about establishing the correctness of GPU results simply by comparing to CPU results computed at the same precision. Obviously there will be a certain amount of error in CPU results as well, and a simple comparison of the GPU and CPU results does not establish how much of the total difference is to be attributed to the error of each platform.

I have handled several reports of “incorrect” GPU results where it turned out that the fairly large differences between CPU and GPU were due to accumulated error on the CPU side, which was larger than the error on the GPU side. I found that most such scenarios could be traced back to two mechanisms:

(1) The use of FMA (fused multiply-add) on the GPU. This reduces overall rounding error and can mitigate effects of subtractive cancellation.
(2) The use of summing via tree-like reduction on the GPU which has a tendency to add quantities of similar magnitude in each step.

I consider the comparison with a high-precision reference (all intermediate computation is performed in double-double, or with a multiple precision library) the final arbiter as to which set of results is the more accurate one, and for establishing the actual error for a given platform.

I have to agree. Albert Einstein apparently once said that “a man with one watch always knows the correct time, but a man with two watches is never sure”, and it applies here too. The underlying assumption that something like MKL should be the final arbiter of whether a GPU result is good or not is generally not a great idea. As Norbert points out, GPU operations like GEMM can actually wind up being more accurate than equivalent calculations done on the CPU, because of fused multiply adds and the structure of the algorithms, which tends to suffer less from truncation error during summation.

I would suggest using something like the 106 bit “double double” gemm routine in XBLAS to compute some reference solutions and compute relative errors of both your MKL and CUBLAS solution using the formula Lung Sheng Chien suggested. That will give a more useful indication of what the comparative accuracy of CUBLAS and MKL is.

Thanks to both of you … I am strongly agree with both of you. I’ll use XBLAS and then compute the camparitive accuracy…

Thanks for the above code segment …Will you please

let me know where can I find the explanation of

how the below formula is derived…

Y := 1.01*(n+1)*|\alpha| * |A| * |B| + |\beta| * |C|

Hi …

I’ve a query … I was using MKL-10.3 which compatible to IEEE-754 2008 standard and

the Device with compute capability 2.x is also compatible to IEEE standard and the

nvcc 3.2 default generate the IEEE compalaint code.

If I calculate the convergence between CPU and GPU resutls using the Lung Sheng Chien

sugesstions , it should be very min < 1E-16 and in this case can I consider the

test is passed …

You can easily write two programs that run on the same (IEEE 754 compliant) CPU yet give different results, as rounding of results may depend on the order of operations.
Since parallel execution almost by definition involves reordering of operations (unless you write the sequential code to exactly match the parallel version), you really can’t expect the same results.

And yes, every single rounded operation should result in a relative error of about 1e-16 max (unless denormals are involved). However, errors can accumulate when operating on larger matrices.

The design of floating point is such that fractional error grows predictably in multiplication and division (up to the limits tera mention), but addition and subtraction instead grow the absolute error in small increments. When the size of the absolute error becomes comparable to your answer, then you suddenly find you have an unexpectedly large fractional error.

Here’s a simple example that you can play with in any Python interpreter (which use double precision for all floating point operations):

>>> a = 1.0 + 2 * 10**-15

>>> b = 1.0 + 10**-15

>>> c = a - b

>>> print "Fractional error:", 1.0 - c/10**-15

Fractional error: 0.1118215803


So with only a handful of operations, I’ve produced an 11% error by doing the worst thing possible: adding numbers which have very different magnitudes followed by subtracting numbers which are nearly identical.

You can find accumulation error of dot product c = <x,y> in any book of numerical analysis.

The constant 1.01 depends on estimation technique.