I am writing a .Net managed wrapper for CUBLAS in C++/CLI. In my test suite, I usually compare the results of blas operations on the device (using cublas) with results from the Intel MKL.

For sgemm vs cublasSgemm I fill out a matrix with random numbers in (0,1) and multiply it with its transpose.

I get slightly different results from the two calls. The differences get larger as I scale the values of the random numbers. isn’t this weird ?

Even weirder is the fact that if I scale the numbers by a large factor (say 10^6)

the difference in results is always a power of two (512,1024 and 2048 for example). the matrices have average sizes (100 by 10).

has someone had this problem before ?