Do results vary depending on GPU or driver versions?

stuu · October 16, 2016, 10:37pm

Is it possible that different graphics cards and driver versions can produce different results when they’re running the same kernel with the same input data? I’m talking about the numbers produced by the computation, not the performance.

We develop and test Cuda code using various mobile Cuda-capable GPUs (mine is a K2100M), production code runs on a GTX980. We noticed that results from the production machine are different to those we see when we’re developing.

Everyone is using Cuda toolkit 8.0 and a recent GPU driver, although the exact driver versions are different. I’ve spent the last week double checking that everything else (configuration, everything CPU-side, input data, etc) is identical but the error is still there.

njuffa · October 17, 2016, 12:32am

The first thing you would want to exclude is the possibility of latent bugs in the code, such as race conditions. Run the code under the control cuda-memcheck and fix any issues that it reports. Smilarly, run with valgrind to exclude the possibility of host-side data corruption.

The driver version could make a difference to the results of floating-point computations if you are using JIT compilation, as the compiler components of the driver are updated more frequently than the offline compiler that is updated with every CUDA version. For example, different compiler versions could apply different heuristics to contracting FMUL + FADD to FMA (fused multiply-add). You can turn off the contraction by compiling with -fmad=false, however this can have negative consequences for accuracy and performance.

Differences due to driver bugs are possible, but unlikely. Defects of the GPU hardware (e.g. causing bit flips in GPU memory) are also possible but rare. Differences in floating-point computations due to architecture-specific optimizations in the offline compiler’s backend are theoretically possible, but I am not aware of such a case ever occurring in over ten years of CUDA use and diagnosing hundreds of program failures.

HannesF99 · October 17, 2016, 7:30am

Yes, you should expect slightly different results for algorithms employing floating-point calculations for different GPUs and also different driver versions. Even on the same GPU we get slightly different results on each run when running certain algorithms - especially algorithms which are iterative and/or employ reductions (like a vector norm etc.) One such class of algorithms are optical flow algorithms. Even on the CPU, if you parallelize your code e.g. with OpenMP and/or vectorize it with SIMD intrinsics, you will get slightly different results. See https://www.nccs.nasa.gov/images/FloatingPoint_consistency.pdf

njuffa · October 17, 2016, 8:42am

Reductions using floating-point atomics, therefore using an unspecified order of operations (-> non-associativity of floating-point addition), can indeed deliver different results on every run on the same GPU. Side remark: There is recently published research by Demmel et. al. on parallel reproducible summation: [url]https://people.eecs.berkeley.edu/~hdnguyen/public/papers/repsum.pdf[/url]

I fail to see how normal iterative processes, with deterministic order of operations, would generate different results on different GPUs, using identical code. It’s past midnight here so I am probably not thinking very clearly right now, but could you give a clarifying example?

HannesF99 · October 17, 2016, 11:18am

I am not 100% sure why it occurs (we have the TV-L1 optical flow library just as binary). I think its because multiple reductions are employed in each iteration, in addtional some nonlinear thresholding operations are applied in each iteration which might amplify the differences additionally. Its just an observation of us when using the flow. Note that the generated motion field is qualitatively the same, so it is not a problem for us in practice.

njuffa · October 17, 2016, 11:24am

From your description, it would appear that the root cause of the discrepancies are reductions containing non-deterministic sequences of floating-point operations. Different result for summation may then lead to a different number of iterations when iterations are controlled by a residual, or comparison with a threshold, etc, etc

stuu · October 17, 2016, 7:59pm

Thanks! That’s pretty much exactly the kind of information I was after.

The code is not iterative, but there are array reductions and other things where the order of floating point operations would change with thread order. It was also written in 2010 and I guess this issue was just ignored since then, so I can’t claim with 100% confidence that this is not the result of a bug/race condition but cuda-memcheck/racecheck doesn’t report any errors and we’ve been pretty thorough in going through the implementation in detail. There’s definitely no host-side corruption, we can check that at run-time.

I didn’t mention explicitly that the results are repeatable on the same machine. I don’t quite understand that because it’s pretty easy to demonstrate that thread order does vary between runs. Shouldn’t any variations in thread order cause non-associativity issues?

njuffa · October 17, 2016, 8:16pm

Depending on the details of your processing and the details of your data, differing thread order may or may not cause different final results due to lack of floating-point associativity.

Since you mention the app dates to 2010: If your reference data is very old, it may have been generated on sm_1x devices, which did not provide a single-precision FMA operation, but rather an non-IEEE 754 compliant FMAD instruction. Depending on code and data details, the switch from FMAD to FMA can cause significant numerical differences.

However, CUDA support for sm_1x devices was discontinued some years back, and all GPUs supported currently provide the same set of floating-point hardware operations. Therefore, for deterministic expression evaluations, any numerical differences should be down to different code generation and / or different libraries (e.g. the standard math library is being continuously improved for performance and accuracy).

stuu · October 17, 2016, 8:35pm

Yeah I guess now it’s all about the details.

The date was just context, I’ve recently inherited responsibility for this code which hasn’t really changed since then. Up-to-date reference data is periodically generated on the production machine, the current set is only a few weeks old. Absolute accuracy is not the goal at this stage, we need to be able to reproduce production results on our development machines which so far we can’t.

We’re only working with sm_30 and above at this stage.

So basically the nondeterministic part is a combination of the non-guarantee of Cuda thread execution order and the previously mentioned implementation details.

phw89 · October 19, 2016, 11:17am

Just to add to this, we also observe this behaviour in our 2D hydrodynamic modelling code.

Our development machines consist of a GTX 1080 (sm_61) and a GTX 690 (sm_30). The code performs a number of floating point reductions per timestep and we compile the code to PTX only, based on sm_30 (i.e. let the driver JIT compile it to SASS).

For a particular GPU architecture (e.g. sm_30) and given the same input data, our integration tests always produce the same results. However, when comparing the results generated by different GPU architectures (e.g. sm_30 vs sm_61), there are always very minor differences in output (in the order of 10E-6). This isn’t too much of a concern and we’ve always just assumed these are due to differences in the order of floating point operations between GPU architectures.

stuu · October 19, 2016, 8:12pm

I was sort-of expecting errors around 10E-6 or smaller, which this thread has mostly confirmed.

What we’re seeing can be as large as 10E-1 but that’s after some nonlinear processing and I haven’t been able to check what the error would be before the nonlinear stages yet.

njuffa · October 20, 2016, 9:06am

Depending on the computation, it is entirely possible that initially tiny differences from a non-deterministic reduction get magnified into very substantial ones, in particular if the problem is ill-conditioned. Long-running simulations can easily be affected by the “butterfly effect”.

Topic		Replies	Views
Same JIT program running on Kepler and Maxwell generate different result CUDA Programming and Performance	5	702	July 26, 2016
Floating Point Accuracy CUDA Programming and Performance	11	30428	April 6, 2013
Is there a difference between GPU double precision and CPU double precision? CUDA Programming and Performance	14	10682	November 26, 2009
cublasSgemm produces non-trivially different results in CUDA 9.1 vs CUDA 8.0 GPU-Accelerated Libraries	9	1127	February 19, 2019
Floating-point precision problems CUDA Programming and Performance	14	4400	January 7, 2011
CPU and GPU floating point calculations Results are different CUDA Programming and Performance	6	21952	August 7, 2010
output difference between quadro K600 and K620 CUDA Programming and Performance	13	4887	December 2, 2014
time dependant simulation problems with floating point precision CUDA Programming and Performance	14	7876	May 13, 2009
cufftExecR2C and cufftExecC2R API calls generates different results in different CUDA tool kit versions GPU-Accelerated Libraries cufft	1	1544	August 9, 2021
Cube computing difference in GPU and CPU? CUDA Programming and Performance	4	515	November 1, 2017

Do results vary depending on GPU or driver versions?

Related topics