varying outcomes cublas on gtx 275

Hi all,

In topic 152261 a similar issue was reported.
SDK example (matrixmultiplication in simpleCublas), GTX 275, driver 258.96, Win 7 64. The differences are small, but creep up to the point where the result is “FAILED”, with an occasional “PASSED”. Cuda version 3.0 and 3.1. The identical executable(s) work(s) fine on a quadro FX 770M (smaller # of SM’s). Loaded cublas dll is cublas32_31_9.dll on the 275 and cublas32_31_4.dll on the quadro (checked in debug output).
The pass/fail test in the program is done on error_norm/ref_norm. I checked this value. On the 275, the best results are just under 1e-6 but will reach about 1e-5 . On the quadro it is consistently 7e-8.
I now use Volkov’s code for matrix multiplication (thanks, Vasili), and that works perfectly, but of course, is not a complete library.
Last thing is, I notice some problems with textures. Can’t get the SDK example convolutionTexture to pass, similar margins.
I tried to find cublas-sources to find out more, but these have been removed.
Guessing, it could be an omitted threadsynchronise or a conflict in accessing global memory - can’t tell, but something that shows up with a large number of threads and/or multiprocessors, I guess.


the problem is accumulation of rounding error, it should be

|x’y - fl(x’y)| <= n * eps * |x|’|y|

where x , y are vector, n is size of x, eps is machine zero, 1.E-6 in single and 1.E-16 in double.

if you extend this estimation in matrix-multiplication, you may have

||C - fl©|| <= n * eps * ||A|| * ||B||

where ||X|| = max{ ||X(:,j)|| for all j }, X is a matrix.

The proper way to measure rounding error is to check if
|| C_gpu - C_cpu || / (||A|| * ||B||) <= n * eps
or not.

Thank you for your answer. It set me out on a very useful trail, finding the classic book by Wilkinson on rounding errors in algebraic processes and your course (containing chapter 20, rounding errors and use of the matmul example and the use of the L2-norm).

First a question of understanding: max{ ||X(:,j)|| for all j } means the largest norm of any column-vector of X, is that correct?

I will check against the accumulated roundoffs.

There is another matter. Not only are the errors too high, but also the results obtained from repetitive cublas sgemm-calculations with the same inputs are not consistent on my machine, as described earlier. They are fully reproducible on another machine (low errors and identical repetitions), but vary on mine. The identical executive and dll’s have been used on both machines. This cannot be correct. The matrices are square, so out-of-array-bounds is not likely. Matmul example and Volkov (as in your lsc_sgemm) work fine. My vote is still on an implementation problem of the kernel code in the libary. I cannot check this and moreover, cublas manual mentions that parts of the library have been written by Vasili…

Rather at a loss,


  1. we focus on y = A * x, then rounding error is

    |y(i) - fl(y(i))| <= n * eps * |A(i,:)| |x|

    this is component-wise estimate, for norm-wise estimate, we have

    ||y - fl(y)|| <= n * eps * ||A|| * ||x||

    where ||x|| = max{|x(i)| for all i} is sup-norm of vector x

      ||A|| is classical sup-norm of matrix A.

For C = A * B, just apply above formula for each column of C, or derive another norm to cover all columns,

that is why I use ||X|| = max{ ||X(:,j)|| for all j }

Remark: I would like to use component-wise formula when testing the alogrithm, because it is more tight than norm-wise.

  1. you find a example that cublas can not produce same result, I think this is a bug, you should report this bug to nvidia.

    matrix multiplication is very simple, no data contention, once tile is determined, then result is reproducible.

    what is version of cuda you use? can you provide the matrix, we can double-check this.

I have done a bit more research. I find that the errors depend on the (square) matrix size. Up to about 160x160 the GTX 275 does ok, after that the errors grow in proportion, except when the matrixsize is a multiple of 16. In that case, the result is correct.

Added you find a zip-file with the source, msc 2008 projectfile, windows executables and a set of ouputs. Also the data of the 2 used GPU’s are added. The Quadro produces no errors.

I worked the "gold"calculation a bit by interchanging the loops for more sequential memory acces and added openmp support.

Otherwise, the program simply runs 100 identical iterations for a range of matrixsizes specified by Nstart and Nend. You can add a -v parameter to output the calculated results including L2norm for each run.

I am very interested to hear whether the results can be reproduced (on gtx 200 or other hardware) and about ideas about the possible nature of the problem.

Jan (97.6 KB)