CUSPARSE(Ax) != CUBLAS(Ax)

cuThomas · September 23, 2010, 8:37pm

Hello,

while evaluating cusparse and some other sparse matrix libraries we encountered different

results for the following operation:

A * x

The following simple example matrix A (2,2) multiplied with the given vector X demonstrates this problem:

Matrix A:

A[0] = 0.939129173755645752;

A[1] = 0.799645721912384033;

A[2] = 0.814138710498809814;

A[3] = 0.594497263431549072;

(We are using a matrix with no zero values here. The failure also appears while using other matrices with zero values)

Vector X:

x[0] = 0.657200932502746582;

x[1] = 0.995299935340881348;

the result of the two operations:

cublasSgemv (...);

cusparseScsrmv(...);

is:

cuSPARSEcuBLAScompare test running..

--CUBLAS-- result of A*x: [1.427508711814880371, 1.117230892181396484]

-CUSPARSE- result of A*x: [1.427508831024169922, 1.117231011390686035]

While testing the cusp library we encountered similar (mabe the same?) wrong results.

We then programmed the sparse matrix multiplication on the CPU and verified that CUBLAS works as expected.

To verify this we wrote an own sparse matrix multiplication kernel which also confirmed that CUBLAS works as expected.

The failure gets worse when using matrices with higher dimensions.

The observed behavior was tested on a GTX 480 (windows + linux) and 8800GTS (linux)

For everyone to test the behavior i attached a zip file with the code + makefile.

To be able to make the program without any issues its important to have the “CUDA Toolkit 3.2 RC” installed

as well as the “GPU Computing SDK code samples”. If you can make these examples without a problem you copy the

source to “GPU Computing SDK code samples”/C/src/ and then just go to the src/cuSPARSEcuBLAScompare/ directory and initiate make.

(This step is necessary because our makefile uses the nvidia samples common-makefile).

The resulting executable is in the same folder as the other sample executables.

tera · September 23, 2010, 8:42pm

How is the result surprising given that you use float variables?

tera · September 23, 2010, 8:42pm

How is the result surprising given that you use float variables?

cuThomas · September 23, 2010, 8:52pm

Its very suprising cause the failure is at 1e^-7 while using a matrix with 4 elements which isnt

a floating point precision failure anymore.

cuThomas · September 23, 2010, 8:52pm

Its very suprising cause the failure is at 1e^-7 while using a matrix with 4 elements which isnt

a floating point precision failure anymore.

tera · September 23, 2010, 9:17pm

What would be a tolerable level of inaccuracy for single precision, then?

tera · September 23, 2010, 9:17pm

What would be a tolerable level of inaccuracy for single precision, then?

cuThomas · September 23, 2010, 10:37pm

I now think too that the root cause of the difference is the floating point accuracy . After analyzing the cusp spmv multiplication kernel
which is also developed by nvidia? and mabe is used in the cusparse library i think the cause for the difference between cusparse and
cublas is the kernel caching algorithm which looks like it splits the results in more floats before accumulating them to the overall result of one dimension.
This leads in many cases to an enhanced inaccuracy.

cuThomas · September 23, 2010, 10:37pm

I now think too that the root cause of the difference is the floating point accuracy . After analyzing the cusp spmv multiplication kernel
which is also developed by nvidia? and mabe is used in the cusparse library i think the cause for the difference between cusparse and
cublas is the kernel caching algorithm which looks like it splits the results in more floats before accumulating them to the overall result of one dimension.
This leads in many cases to an enhanced inaccuracy.

YDD · September 23, 2010, 11:27pm

An enhanced inaccuracy? Suppose I wish to evaluate a+b+c. Is it more accurate to evaluate it as
( a + b ) + c
or as
a + ( b + c )
?

YDD · September 23, 2010, 11:27pm

An enhanced inaccuracy? Suppose I wish to evaluate a+b+c. Is it more accurate to evaluate it as
( a + b ) + c
or as
a + ( b + c )
?

cuThomas · September 24, 2010, 7:44am

I see your point but i think the problem is of the following nature:

you have two floats and you could sum them up in the following way:

float a = 1.0;

float b = 1.0;

result = a + b;

print result;

2.000000000000

or you could do it in the following way:

float a = 1.0;

float b = 1.0;

for (int i = 0; i < 5000000; i++) {

	a = a + 1.1;

}

for (int i = 0; i < 5000000; i++) {

	a = a - 1.1;

}

result = a + b;

print result;

2.171823024749755859

its your decision which one you would choose. i would choose the first one.

my point in this is, that cusparse Scsrmv is more inaccurate than it could be

cuThomas · September 24, 2010, 7:44am

I see your point but i think the problem is of the following nature:

you have two floats and you could sum them up in the following way:

float a = 1.0;

float b = 1.0;

result = a + b;

print result;

2.000000000000

or you could do it in the following way:

float a = 1.0;

float b = 1.0;

for (int i = 0; i < 5000000; i++) {

	a = a + 1.1;

}

for (int i = 0; i < 5000000; i++) {

	a = a - 1.1;

}

result = a + b;

print result;

2.171823024749755859

its your decision which one you would choose. i would choose the first one.

my point in this is, that cusparse Scsrmv is more inaccurate than it could be

avidday · September 24, 2010, 9:26am

Albert Einstein is reported to have once said that

You might find it useful to study the paper at this link.

avidday · September 24, 2010, 9:26am

Albert Einstein is reported to have once said that

You might find it useful to study the paper at this link.

Cliff_Woolley · September 24, 2010, 8:44pm

As others on this thread have correctly stated, this is a floating-point precision problem. Single-precision floating point generally gets you about 6-7 decimal digits of precision at most, so what you’re seeing is exactly what we’d expect.

Here’s why:

Single-precision floating point has 24 significant bits, of which 23 are explicitly stored (plus 8 bits of exponent and 1 sign bit). 2^24 = 16777216 possible representable values (for a particular exponent) at most; i.e., absolutely no more than 8 (but probably more like 6 or 7 after a lot of accumulated arithmetic rounding errors) significant decimal digits.

Let’s look at what that means in terms of one of your pairs of example output numbers in their IEEE-754 single-precision representations:
1.427508711814880371 = 0x3fb6b89b
1.427508831024169922 = 0x3fb6b89c

I.e., these two numbers only differ by one least significant bit.

Specifically:
Both have a sign bit of 0 [positive]
Both have an exponent of 01111111 [i.e., 0 after subtracting the bias of 127, meaning the significand is multiplied by 2^0 = 1]
The significand of the first is B6B89B ~= 1.4275087; the significand of the second is B6B89C ~= 1.4275088

Hope this helps.
–Cliff

Cliff_Woolley · September 24, 2010, 8:44pm

As others on this thread have correctly stated, this is a floating-point precision problem. Single-precision floating point generally gets you about 6-7 decimal digits of precision at most, so what you’re seeing is exactly what we’d expect.

Here’s why:

Single-precision floating point has 24 significant bits, of which 23 are explicitly stored (plus 8 bits of exponent and 1 sign bit). 2^24 = 16777216 possible representable values (for a particular exponent) at most; i.e., absolutely no more than 8 (but probably more like 6 or 7 after a lot of accumulated arithmetic rounding errors) significant decimal digits.

Let’s look at what that means in terms of one of your pairs of example output numbers in their IEEE-754 single-precision representations:
1.427508711814880371 = 0x3fb6b89b
1.427508831024169922 = 0x3fb6b89c

I.e., these two numbers only differ by one least significant bit.

Specifically:
Both have a sign bit of 0 [positive]
Both have an exponent of 01111111 [i.e., 0 after subtracting the bias of 127, meaning the significand is multiplied by 2^0 = 1]
The significand of the first is B6B89B ~= 1.4275087; the significand of the second is B6B89C ~= 1.4275088

Hope this helps.
–Cliff

Cliff_Woolley · September 24, 2010, 9:00pm

You may be surprised to note here that the value of a after this loop is nowhere close to the 5500001.0 that you might expect: it’s actually ~5220503.0. The number 1.1 is not exactly representable in floating point. The result of adding a small (approximated) value to a large one in floating point is that you get very big relative roundoff errors. In your example, those errors are very significant. That’s just the way it is with floating point.

But as I pointed out previously, CUSPARSE hasn’t made this mistake; it’s only 1 LSB of difference for your original comparison.

–Cliff

Cliff_Woolley · September 24, 2010, 9:00pm

You may be surprised to note here that the value of a after this loop is nowhere close to the 5500001.0 that you might expect: it’s actually ~5220503.0. The number 1.1 is not exactly representable in floating point. The result of adding a small (approximated) value to a large one in floating point is that you get very big relative roundoff errors. In your example, those errors are very significant. That’s just the way it is with floating point.

But as I pointed out previously, CUSPARSE hasn’t made this mistake; it’s only 1 LSB of difference for your original comparison.

–Cliff

Topic		Replies	Views
low cublas single precision accuracy cublasDgemv <-> cublasSgemv CUDA Programming and Performance	3	1874	July 23, 2008
CUBLAS floating point precision CUDA Programming and Performance	0	868	June 18, 2009
Question regarding Precision Issues in BLAS CUDA Programming and Performance	9	8562	March 4, 2010
Sparse Matrix-Vector Multiplication on CUDA CUDA Programming and Performance	79	313739	November 22, 2010
cuSPARSE matrix vector multiplication non deterministic output GPU-Accelerated Libraries	1	1047	March 31, 2016
Significant difference in results between MKL-BLAS & CUBLAS different results in Cgemm CUDA Programming and Performance	9	4998	August 31, 2009
Accuracy in GPU floating point calculations CUDA Programming and Performance	35	8327	September 9, 2011
CUSPARSE conversion routines not working... cusparseSnnz and cusparseSdense2csr misbehaving... CUDA Programming and Performance	11	4200	February 28, 2011
float asssociative Debugging error CUDA Programming and Performance	10	2263	April 12, 2010
Wrong result sparse matrix-vector multiplication Legacy PGI Compilers	5	3183	August 29, 2018

CUSPARSE(A*x) != CUBLAS(A*x)

Related topics

CUSPARSE(Ax) != CUBLAS(Ax)