GPU DGEMM

beta1 · January 31, 2011, 11:52am

We are doing double precision Matrix Matrix Multiplication on Tesla C2050 using CUBLAS 3.2
and the same computation on CPU (Intel Xeon X5450 Dual Socket Quad Core System) using
Intel MKL 10.2 for comparing the results of GPU against the CPU. But we are getting the
divergence in the GPU results against the CPU.

We could see only correct results till ten decimal points. and after that there is deviation.

Could you please help us to decided the computation over the divergence

We are using the drand48() function to generate the Input Matrices A,B.

avidday · January 31, 2011, 1:07pm

So you are getting absolute errors of about 10^-11? What are the relative errors?

LSChien · January 31, 2011, 3:00pm

Suppose you do C_gpu := \alpha * A * B + \beta * C
A is nxn and B is nxn and C is nxn

then define Y := 1.01*(n+1)*|\alpha| * |A| * |B| + |\beta| * |C|

where |A| means absolute value of each component of A

Then compute
D = |C_gpu - C_cpu|./Y
where ./ is componentwise division.

check if D(i,j) < 2*eps for all i,j

where eps = machine zero, it is 1.E-16 in double precision.

beta1 · February 1, 2011, 4:52am

I am getting relative error 10^-15 …

beta1 · February 1, 2011, 4:53am

hi…

Thanks for the reply … I’ll try with this…

avidday · February 1, 2011, 6:41am

That is getting pretty close to the double precision epsilon value, which is 2^-53, or about 2.22e-16…

beta1 · February 1, 2011, 8:17am

I need your sugesstion … I would like to keep one threshold value, for coming to the conculsion that

the computation done on GPU is correct . Means the GPU DGEMM test is passed. What is your sugesstion

for that how much precision value I should keep?

Thanks

njuffa · February 1, 2011, 10:04am

I would like to add a word of caution about establishing the correctness of GPU results simply by comparing to CPU results computed at the same precision. Obviously there will be a certain amount of error in CPU results as well, and a simple comparison of the GPU and CPU results does not establish how much of the total difference is to be attributed to the error of each platform.

I have handled several reports of “incorrect” GPU results where it turned out that the fairly large differences between CPU and GPU were due to accumulated error on the CPU side, which was larger than the error on the GPU side. I found that most such scenarios could be traced back to two mechanisms:

(1) The use of FMA (fused multiply-add) on the GPU. This reduces overall rounding error and can mitigate effects of subtractive cancellation.
(2) The use of summing via tree-like reduction on the GPU which has a tendency to add quantities of similar magnitude in each step.

I consider the comparison with a high-precision reference (all intermediate computation is performed in double-double, or with a multiple precision library) the final arbiter as to which set of results is the more accurate one, and for establishing the actual error for a given platform.

avidday · February 1, 2011, 12:12pm

I have to agree. Albert Einstein apparently once said that “a man with one watch always knows the correct time, but a man with two watches is never sure”, and it applies here too. The underlying assumption that something like MKL should be the final arbiter of whether a GPU result is good or not is generally not a great idea. As Norbert points out, GPU operations like GEMM can actually wind up being more accurate than equivalent calculations done on the CPU, because of fused multiply adds and the structure of the algorithms, which tends to suffer less from truncation error during summation.

I would suggest using something like the 106 bit “double double” gemm routine in XBLAS to compute some reference solutions and compute relative errors of both your MKL and CUBLAS solution using the formula Lung Sheng Chien suggested. That will give a more useful indication of what the comparative accuracy of CUBLAS and MKL is.

beta1 · February 2, 2011, 4:47am

Thanks to both of you … I am strongly agree with both of you. I’ll use XBLAS and then compute the camparitive accuracy…

beta1 · February 3, 2011, 12:59pm

Thanks for the above code segment …Will you please

let me know where can I find the explanation of

how the below formula is derived…

Y := 1.01*(n+1)*|\alpha| * |A| * |B| + |\beta| * |C|

…

beta1 · February 3, 2011, 1:10pm

I have to agree. Albert Einstein apparently once said that “a man with one watch always knows the correct time, but a man with two watches is never sure”, and it applies here too. The underlying assumption that something like MKL should be the final arbiter of whether a GPU result is good or not is generally not a great idea. As Norbert points out, GPU operations like GEMM can actually wind up being more accurate than equivalent calculations done on the CPU, because of fused multiply adds and the structure of the algorithms, which tends to suffer less from truncation error during summation.

I would suggest using something like the 106 bit “double double” gemm routine in XBLAS to compute some reference solutions and compute relative errors of both your MKL and CUBLAS solution using the formula Lung Sheng Chien suggested. That will give a more useful indication of what the comparative accuracy of CUBLAS and MKL is.

Hi …

I’ve a query … I was using MKL-10.3 which compatible to IEEE-754 2008 standard and

the Device with compute capability 2.x is also compatible to IEEE standard and the

nvcc 3.2 default generate the IEEE compalaint code.

If I calculate the convergence between CPU and GPU resutls using the Lung Sheng Chien

sugesstions , it should be very min < 1E-16 and in this case can I consider the

test is passed …

Let me know your views

tera · February 3, 2011, 3:25pm

You can easily write two programs that run on the same (IEEE 754 compliant) CPU yet give different results, as rounding of results may depend on the order of operations.
Since parallel execution almost by definition involves reordering of operations (unless you write the sequential code to exactly match the parallel version), you really can’t expect the same results.

And yes, every single rounded operation should result in a relative error of about 1e-16 max (unless denormals are involved). However, errors can accumulate when operating on larger matrices.

seibert · February 4, 2011, 1:57pm

The design of floating point is such that fractional error grows predictably in multiplication and division (up to the limits tera mention), but addition and subtraction instead grow the absolute error in small increments. When the size of the absolute error becomes comparable to your answer, then you suddenly find you have an unexpectedly large fractional error.

Here’s a simple example that you can play with in any Python interpreter (which use double precision for all floating point operations):

>>> a = 1.0 + 2 * 10**-15

>>> b = 1.0 + 10**-15

>>> c = a - b

>>> print "Fractional error:", 1.0 - c/10**-15

Fractional error: 0.1118215803

So with only a handful of operations, I’ve produced an 11% error by doing the worst thing possible: adding numbers which have very different magnitudes followed by subtracting numbers which are nearly identical.

LSChien · February 6, 2011, 1:44pm

You can find accumulation error of dot product c = <x,y> in any book of numerical analysis.

The constant 1.01 depends on estimation technique.

Topic		Replies	Views
Question regarding Precision Issues in BLAS CUDA Programming and Performance	9	8520	March 4, 2010
CPU+GPU dgemm CUDA Programming and Performance	8	35750	January 31, 2011
Significant difference in results between MKL-BLAS & CUBLAS different results in Cgemm CUDA Programming and Performance	9	4960	August 31, 2009
sgemm precision wrong results cublasSgemm vs MKL sgemm CUDA Programming and Performance	4	5339	December 22, 2007
Accuracy in GPU floating point calculations CUDA Programming and Performance	35	8231	September 9, 2011
FMA precision issue CUDA Programming and Performance	9	19366	November 21, 2010
How to disable/enable ECC on C2050? CUDA Programming and Performance	22	14019	April 24, 2010
Is there a difference between GPU double precision and CPU double precision? CUDA Programming and Performance	14	10761	November 26, 2009
the Lost of Accuracy using Cublas CUDA Programming and Performance	6	3188	January 21, 2010
CUBLAS VS CBLAS sgemv Benchmarking matrix-vector operations on GPU and CPU CUDA Programming and Performance	5	10040	March 24, 2014

GPU DGEMM

Related topics