In topic 152261 a similar issue was reported.
SDK example (matrixmultiplication in simpleCublas), GTX 275, driver 258.96, Win 7 64. The differences are small, but creep up to the point where the result is “FAILED”, with an occasional “PASSED”. Cuda version 3.0 and 3.1. The identical executable(s) work(s) fine on a quadro FX 770M (smaller # of SM’s). Loaded cublas dll is cublas32_31_9.dll on the 275 and cublas32_31_4.dll on the quadro (checked in debug output).
The pass/fail test in the program is done on error_norm/ref_norm. I checked this value. On the 275, the best results are just under 1e-6 but will reach about 1e-5 . On the quadro it is consistently 7e-8.
I now use Volkov’s code for matrix multiplication (thanks, Vasili), and that works perfectly, but of course, is not a complete library.
Last thing is, I notice some problems with textures. Can’t get the SDK example convolutionTexture to pass, similar margins.
I tried to find cublas-sources to find out more, but these have been removed.
Guessing, it could be an omitted threadsynchronise or a conflict in accessing global memory - can’t tell, but something that shows up with a large number of threads and/or multiprocessors, I guess.

Thank you for your answer. It set me out on a very useful trail, finding the classic book by Wilkinson on rounding errors in algebraic processes and your course (containing chapter 20, rounding errors and use of the matmul example and the use of the L2-norm).

First a question of understanding: max{ ||X(:,j)|| for all j } means the largest norm of any column-vector of X, is that correct?

I will check against the accumulated roundoffs.

There is another matter. Not only are the errors too high, but also the results obtained from repetitive cublas sgemm-calculations with the same inputs are not consistent on my machine, as described earlier. They are fully reproducible on another machine (low errors and identical repetitions), but vary on mine. The identical executive and dll’s have been used on both machines. This cannot be correct. The matrices are square, so out-of-array-bounds is not likely. Matmul example and Volkov (as in your lsc_sgemm) work fine. My vote is still on an implementation problem of the kernel code in the libary. I cannot check this and moreover, cublas manual mentions that parts of the library have been written by Vasili…

I have done a bit more research. I find that the errors depend on the (square) matrix size. Up to about 160x160 the GTX 275 does ok, after that the errors grow in proportion, except when the matrixsize is a multiple of 16. In that case, the result is correct.

Added you find a zip-file with the source, msc 2008 projectfile, windows executables and a set of ouputs. Also the data of the 2 used GPU’s are added. The Quadro produces no errors.

I worked the "gold"calculation a bit by interchanging the loops for more sequential memory acces and added openmp support.

Otherwise, the program simply runs 100 identical iterations for a range of matrixsizes specified by Nstart and Nend. You can add a -v parameter to output the calculated results including L2norm for each run.

I am very interested to hear whether the results can be reproduced (on gtx 200 or other hardware) and about ideas about the possible nature of the problem.