cublasSgemm produces non-trivially different results in CUDA 9.1 vs CUDA 8.0


I’m getting non-trivially different numerical results from cublasSgemm on the identical test code with the identical inputs, with CUDA SDK being the only difference between the test executables (CUDA 9.1 vs CUDA 8.0).

The differences are in the 0.00001-0.00000001 range, and they are not just a couple of numbers here and there; it’s pretty much the whole result set:

The code to reproduce is less than 200 lines long and is available here (together with the input data set): I’m compiling it with VS2012.

Is there a good explanation for this discrepancy or does this smell like a bug?

Thanks in advance,

variation in the 6th (most) significant digit or beyond is quite normal for float calculations in my experience.

Some illustration of why this can occur is given here:

The floating-point order-of-operations can change for what is the “same” calculation or library call based on a variety of factors, including GPU type, driver version, and CUDA version. Usually, by switching to double calculations (e.g. cublasDgemm) the relative accuracy can be improved, although I haven’t tried it in your case.

similar questions come up from time to time. here is another recent one:

I grabbed a random example from your list of mismatches and looked at the bit pattern for the ‘float’ variables:

-0.1540562809 = 0xbe1dc0ee
-0.1540563852 = 0xbe1dc0f5

So this particular case is a 7 ulp difference. As Robert Crovella explained, such discrepancies are plausibly attributable to a different order of arithmetic operations, since floating-point arithmetic is not associative. Elements in the result matrix of GEMM are typically long dot products, which may be assembled in all kind of orders and chunk sizes depending on the parallelization scheme used.

The more interesting question is, what is the deviation of either result from the mathematically correct result?

Thank you both for the prompt replies. I generally get that floating point arithmetic is not associative and thus a change in an order of operations can easily cause the discrepancies along the lines of what I’m seeing.

The weird thing, or at least what seems weird to me, is that I’m not seeing the same discrepancy for a similar function used in the same algorithm, cublasSgeam; all our cublasSgeam results between CUDA 8.0 and CUDA 9.1 are exactly the same. Sure, it’s a different algorithm, and the input matrices are different, but the fact that these results remained stable between SDK versions and the cublasSgemm results changed noticeably is partly what prompted this question.

I guess I’m looking for a confirmation that there was a change in the order of operations between CUDA 8.0 and CUDA 9.1 specifically as far as cublasSgemm is concerned?

The question of what is the deviation of either result from the mathematically correct result is super interesting. It’d be nice if something like that was a part of a public test suite…

As you can see when you use the profiler, GEMM maps to myriad kernels under the hood based on GPU architecture, matrix dimensions, matrix shape (aspect ratio), matrix transpose mode, etc.

It is possible that the implementation of one of these kernels changed between CUDA versions, maybe to fix a bug or enhance performance. It is also possible that the picking logic was modified so different kernels are used for the same matrices when changing CUDA versions.

As opposed to interface or specification changes, internal library changes are implementation artifacts and as such communicated to the public only in rare circumstances. This applies to any vendor on any platform. Granted, with open source you could investigate what has changed in the library source code; you cannot do that for CUBLAS other than possibly checking if different kernels are used (name changes in the profiler output would be a good indication).

Stating any kind of GEMM error bounds for the general universe of all possible sets of three matrices is an open research problem, best I know. You can easily convince yourself of that by writing a very simple GEMM implementation and then testing it against a few million different random plus special-case matrices while trying to check against a specific previously set error bound. If you have any hair left on your head after that exercise, let us know :-) Been there, done that, got the t-shirt.

Haha, thanks for the fair warning.

I agree that asking NVIDIA to document internal changes would be unreasonable. At the same time, simply saying “we changed a few things internally, expect a slight difference in results in X, Y and Z” in the release notes would go a long way.

It’s already done in some situations — presumably to preempt a flood of threads like this one :) — e.g. says (my emphasis) "The math mode must be set to CUBLAS_TENSOR_OP_MATH. Floating point math is not associative, so the results of the Tensor Core math routines are not quite bit-equivalent to the results of the analogous non-Tensor Core math routines. ".

It’d be super nice to have it done more consistently; after all, it’s one thing to know that things can change, and another one to be explicitly told that they did.

Either way, I appreciate your sharing of insight and expertise, thanks a lot!

I boldly claim that slight differences in floating-point results should always be expected when changing library versions. The only time such changes are typically pointed out is when they impact properties that are part of specifications, such as the accuracy of standard math functions. I don’t think NVIDIA handles the messaging in this respect in a materially different fashion compared to other vendors including open source projects like glibc.

Point taken. For a library that exclusively deals with floating-point math, though, a permanent note along those lines (“slight differences in floating-point results should always be expected when changing library versions”) at the top of the release notes would still be better than the assumed implicit wisdom that one eventually comes to possess :).

If you’d like to see a change in CUDA documentation, the recommended approach is to file a bug at

Thank you, Robert. I was about to, but then I thought that I should triple-check the existing docs, and sure enough, they already state exactly what I was asking for, just not “at the top of the release notes”:

By design, all CUBLAS API routines from a given toolkit version, generate the same bit-wise results at every run when executed on GPUs with the same architecture and the same number of SMs. However, bit-wise reproducibility is not guaranteed across toolkit version because the implementation might differ due to some implementation changes.


I apologize for not looking carefully enough before complaining!