sgemm precision wrong results cublasSgemm vs MKL sgemm

I am writing a .Net managed wrapper for CUBLAS in C++/CLI. In my test suite, I usually compare the results of blas operations on the device (using cublas) with results from the Intel MKL.

For sgemm vs cublasSgemm I fill out a matrix with random numbers in (0,1) and multiply it with its transpose.

I get slightly different results from the two calls. The differences get larger as I scale the values of the random numbers. isn’t this weird ?

Even weirder is the fact that if I scale the numbers by a large factor (say 10^6)
the difference in results is always a power of two (512,1024 and 2048 for example). the matrices have average sizes (100 by 10).

has someone had this problem before ?

No, it is not. Both MKL and CUBLAS results are affected by roundoff errors, but in different ways. The difference may come from non-associativity of floating point addition: in finite precision arithmetics it is generally not true that a+(b+c) = (a+b)+c.

Hovewer, the error is bounded as |C(exact)-C(computed)| <= neps|A|*|B|+ O(eps^2), where eps is the machine precision (2^-24) and n is the common dimension of matrices A and B that you multiply. You can try using this to check if the multiplication was performed correctly.

Other way to test correctness is to take matrices with sufficiently small integer entries. In that case all arithmetic operations are exact.

That’s due to the IEEE format. Check out “What every computer scientist should know about floating point arithmetic” by David Goldberg. There’s a bunch of copies online, one is at http://docs.sun.com/source/806-3568/ncg_goldberg.html (first one that came up from google today).

another reason why results are different is that GPU uses fused multiply-and-add, that involves only one rounding instead of two the CPU does.

Actually, the programming guide (B.1) says:

So, there is an intermediate rounding.