Significant difference in results between MKL-BLAS & CUBLAS different results in Cgemm

MMB · August 24, 2009, 9:14pm

As part of my learning Cuda and how to use it in conjunction with Fortran, I have written Fortran code that creates, randomly, two square matrices in single complex form. I then have five different approaches to multiplying them:
-Naive, brute force row-column multiplication and addition;
-use the Fortran intrinsic MATMUL;
-use MKL gemm;
-use my own Cuda code;
-use CUBLAS Cgemm.

After computing with each of the above methods, I calculate the norm of the matrix. The first three methods, which are native to Fortran, give me identical answers. The last two methods also give identical answers to each other, but significantly different from the first group.

When I print out the resulting matrix for each method, the matrix from the first group is markedly different from the first group. I have twiddled the ‘N’ and '‘T’ options on the CUBLAS Cgemm but it doen’t make any difference.

The nVidia card is a C1060, and it runs the MATMUL example correctly. Unfortunately, that is only in single precision.

I would be interested in all comments. If you have used the CUBLAS Cgemm successfully, I would appreciate an opportunity to communicate with you.

Thanks

Malcolm

SPWorley · August 24, 2009, 11:28pm

The first and obvious question to ask is whether the CPU versions were computing results with double precision, and your GPU versions were using single precision?

MMB · August 25, 2009, 12:04am

Thanks for responding. All code is running in single precision. The differences between the two sets of results is typically in the double digits, percentage-wise.

Malcolm

tmurray · August 25, 2009, 12:24am

keep in mind that unless you are explicitly compiling to use SSE for floating point, you’re still using 80-bit floating point on the CPU side internally until it actually moves results out of x87 registers.

e.ping · August 25, 2009, 12:39am

just for the fun of it: Getting 80bit out of -O0 code is a challenge. I tried a reasonable cross product of Intel, Pathscale, PGI, SunStudio and gcc (at least two major versions per compiler, e.g. PGI8 and 9) on 32 and 64 bit linux boxes, and found no rule of thumb. Adding gcc’s -ffloat-store and equivalents of course solved the problem and made life consistent, which just confirms my claim that not all compilers behave identically using “default unoptimized options”. Ah, I tried this for my own implementation of saxpy, not Cgemm.

Nonetheless, if a true single precision CPU version (which MKL kinda implies) and a true single precision GPU version show differences in the double precision exponent range, then so what. That’s noise due to parallelism, to be more precise, due to the different order of computation.

How exactly do you define “identical results” inbetween your two sets of results? As Tim ventured, are you 100% sure that your CPU computations are true 32 bit single precision?

MMB · August 25, 2009, 1:29am

Thanks for responding. The differences I am observing are in the double-digits, percentage wise. So, I would be very surprised if the additional precision offered by 80-bit would account for such a large difference.

I am baffled by this, and I don’t have any ideas for further numerical experimentation.

One minor uncertainty: I am assuming that the size of cuComplex is 8. Is this correct?

Thanks

Malcolm

MMB · August 25, 2009, 1:33pm

As part of my learning Cuda and how to use it in conjunction with Fortran, I have written Fortran code that creates, randomly, two square matrices in single complex form. I then have five different approaches to multiplying them:

-Naive, brute force row-column multiplication and addition;

-use the Fortran intrinsic MATMUL;

-use MKL gemm;

-use my own Cuda code;

-use CUBLAS Cgemm.

After computing with each of the above methods, I calculate the norm of the matrix. The first three methods, which are native to Fortran, give me identical answers. The last two methods also give identical answers to each other, but significantly different from the first group.

When I print out the resulting matrix for each method, the matrix from the first group is markedly different from the first group. I have twiddled the ‘N’ and '‘T’ options on the CUBLAS Cgemm but it doen’t make any difference.

The nVidia card is a C1060, and it runs the MATMUL example correctly. Unfortunately, that is only in single precision.

I would be interested in all comments. If you have used the CUBLAS Cgemm successfully, I would appreciate an opportunity to communicate with you.

Thanks

Malcolm

To furthur clarify the naure of my findings. All calculations are done in single, complex arithmetic in both the Fortran and C/Cuda sections of the code. I am using Intel Fortran on an X86 platform, using Redhat Linux.

Typical results are:

Norm (Naive): 36.76490

Norm (MATMUL): 36.76490

Norm (MKL): 36.76490

Norm (Cuda): 25.35245

Norm (CUBLAS): 25.35245

Malcolm

Sarnath · August 26, 2009, 12:30pm

Malcolm,

You are NOT alone. I have seen worser differences for my factorization code (google: LDM^T solver site:http://forums.nvidia.com).

Even between 2 different CPU implementations… (forget GPU)…

And, when I changed to double precision – I started getting bettttter results…

Quite a baffling experience… It was just un-believable for me… But I think, it is just the sheer number of floating point operations…that is amplifying the errors.

MMB · August 26, 2009, 1:10pm

Hi Sarnath. Thanks for your observations. I wish I had posted an example of the numbers in my original post. However, I did so yesterday. Please take a look at them. I don’t think moving to double precision will resolve the discrepancies.

Malcolm

Sarnath · August 31, 2009, 5:19am

Malcolm,

I am not sure about this… But single precision arithmetic has some deviations in the GPU – but not sure if it affects Cgemm… You should get it checked with double-precision … mainly because CPU internally uses more precision internally (thats what I hear).

Topic		Replies	Views
sgemm precision wrong results cublasSgemm vs MKL sgemm CUDA Programming and Performance	4	5392	December 22, 2007
low cublas single precision accuracy cublasDgemv <-> cublasSgemv CUDA Programming and Performance	3	1902	July 23, 2008
Question regarding Precision Issues in BLAS CUDA Programming and Performance	9	8624	March 4, 2010
CUBLAS floating point precision CUDA Programming and Performance	0	879	June 18, 2009
GPU DGEMM CUDA Programming and Performance	14	43823	February 6, 2011
CPU+GPU dgemm CUDA Programming and Performance	8	35844	January 31, 2011
Comparing DGEMM - Intel MKL and Cublas Legacy PGI Compilers	3	21021	September 9, 2010
cublas return different result with cpu CUDA Programming and Performance	1	1430	November 25, 2011
Matlab mex file using cublas - problems CUDA Programming and Performance	13	9062	October 13, 2009
Odd timing results Intel MKL vs. My GPU implementation CUDA Programming and Performance	5	3611	July 24, 2008

Significant difference in results between MKL-BLAS & CUBLAS different results in Cgemm

Related topics