As part of my learning Cuda and how to use it in conjunction with Fortran, I have written Fortran code that creates, randomly, two square matrices in single complex form. I then have five different approaches to multiplying them:
-Naive, brute force row-column multiplication and addition;
-use the Fortran intrinsic MATMUL;
-use MKL gemm;
-use my own Cuda code;
-use CUBLAS Cgemm.
After computing with each of the above methods, I calculate the norm of the matrix. The first three methods, which are native to Fortran, give me identical answers. The last two methods also give identical answers to each other, but significantly different from the first group.
When I print out the resulting matrix for each method, the matrix from the first group is markedly different from the first group. I have twiddled the ‘N’ and '‘T’ options on the CUBLAS Cgemm but it doen’t make any difference.
The nVidia card is a C1060, and it runs the MATMUL example correctly. Unfortunately, that is only in single precision.
I would be interested in all comments. If you have used the CUBLAS Cgemm successfully, I would appreciate an opportunity to communicate with you.
The first and obvious question to ask is whether the CPU versions were computing results with double precision, and your GPU versions were using single precision?
Thanks for responding. All code is running in single precision. The differences between the two sets of results is typically in the double digits, percentage-wise.
keep in mind that unless you are explicitly compiling to use SSE for floating point, you’re still using 80-bit floating point on the CPU side internally until it actually moves results out of x87 registers.
just for the fun of it: Getting 80bit out of -O0 code is a challenge. I tried a reasonable cross product of Intel, Pathscale, PGI, SunStudio and gcc (at least two major versions per compiler, e.g. PGI8 and 9) on 32 and 64 bit linux boxes, and found no rule of thumb. Adding gcc’s -ffloat-store and equivalents of course solved the problem and made life consistent, which just confirms my claim that not all compilers behave identically using “default unoptimized options”. Ah, I tried this for my own implementation of saxpy, not Cgemm.
Nonetheless, if a true single precision CPU version (which MKL kinda implies) and a true single precision GPU version show differences in the double precision exponent range, then so what. That’s noise due to parallelism, to be more precise, due to the different order of computation.
How exactly do you define “identical results” inbetween your two sets of results? As Tim ventured, are you 100% sure that your CPU computations are true 32 bit single precision?
Thanks for responding. The differences I am observing are in the double-digits, percentage wise. So, I would be very surprised if the additional precision offered by 80-bit would account for such a large difference.
I am baffled by this, and I don’t have any ideas for further numerical experimentation.
One minor uncertainty: I am assuming that the size of cuComplex is 8. Is this correct?
To furthur clarify the naure of my findings. All calculations are done in single, complex arithmetic in both the Fortran and C/Cuda sections of the code. I am using Intel Fortran on an X86 platform, using Redhat Linux.
You are NOT alone. I have seen worser differences for my factorization code (google: LDM^T solver site:http://forums.nvidia.com).
Even between 2 different CPU implementations… (forget GPU)…
And, when I changed to double precision – I started getting bettttter results…
Quite a baffling experience… It was just un-believable for me… But I think, it is just the sheer number of floating point operations…that is amplifying the errors.
Hi Sarnath. Thanks for your observations. I wish I had posted an example of the numbers in my original post. However, I did so yesterday. Please take a look at them. I don’t think moving to double precision will resolve the discrepancies.
I am not sure about this… But single precision arithmetic has some deviations in the GPU – but not sure if it affects Cgemm… You should get it checked with double-precision … mainly because CPU internally uses more precision internally (thats what I hear).