Is it possible that different graphics cards and driver versions can produce different results when they’re running the same kernel with the same input data? I’m talking about the numbers produced by the computation, not the performance.
We develop and test Cuda code using various mobile Cuda-capable GPUs (mine is a K2100M), production code runs on a GTX980. We noticed that results from the production machine are different to those we see when we’re developing.
Everyone is using Cuda toolkit 8.0 and a recent GPU driver, although the exact driver versions are different. I’ve spent the last week double checking that everything else (configuration, everything CPU-side, input data, etc) is identical but the error is still there.
The first thing you would want to exclude is the possibility of latent bugs in the code, such as race conditions. Run the code under the control cuda-memcheck and fix any issues that it reports. Smilarly, run with valgrind to exclude the possibility of host-side data corruption.
The driver version could make a difference to the results of floating-point computations if you are using JIT compilation, as the compiler components of the driver are updated more frequently than the offline compiler that is updated with every CUDA version. For example, different compiler versions could apply different heuristics to contracting FMUL + FADD to FMA (fused multiply-add). You can turn off the contraction by compiling with -fmad=false, however this can have negative consequences for accuracy and performance.
Differences due to driver bugs are possible, but unlikely. Defects of the GPU hardware (e.g. causing bit flips in GPU memory) are also possible but rare. Differences in floating-point computations due to architecture-specific optimizations in the offline compiler’s backend are theoretically possible, but I am not aware of such a case ever occurring in over ten years of CUDA use and diagnosing hundreds of program failures.
Yes, you should expect slightly different results for algorithms employing floating-point calculations for different GPUs and also different driver versions. Even on the same GPU we get slightly different results on each run when running certain algorithms - especially algorithms which are iterative and/or employ reductions (like a vector norm etc.) One such class of algorithms are optical flow algorithms. Even on the CPU, if you parallelize your code e.g. with OpenMP and/or vectorize it with SIMD intrinsics, you will get slightly different results. See https://www.nccs.nasa.gov/images/FloatingPoint_consistency.pdf
Reductions using floating-point atomics, therefore using an unspecified order of operations (-> non-associativity of floating-point addition), can indeed deliver different results on every run on the same GPU. Side remark: There is recently published research by Demmel et. al. on parallel reproducible summation: https://people.eecs.berkeley.edu/~hdnguyen/public/papers/repsum.pdf
I fail to see how normal iterative processes, with deterministic order of operations, would generate different results on different GPUs, using identical code. It’s past midnight here so I am probably not thinking very clearly right now, but could you give a clarifying example?
I am not 100% sure why it occurs (we have the TV-L1 optical flow library just as binary). I think its because multiple reductions are employed in each iteration, in addtional some nonlinear thresholding operations are applied in each iteration which might amplify the differences additionally. Its just an observation of us when using the flow. Note that the generated motion field is qualitatively the same, so it is not a problem for us in practice.
From your description, it would appear that the root cause of the discrepancies are reductions containing non-deterministic sequences of floating-point operations. Different result for summation may then lead to a different number of iterations when iterations are controlled by a residual, or comparison with a threshold, etc, etc
Thanks! That’s pretty much exactly the kind of information I was after.
The code is not iterative, but there are array reductions and other things where the order of floating point operations would change with thread order. It was also written in 2010 and I guess this issue was just ignored since then, so I can’t claim with 100% confidence that this is not the result of a bug/race condition but cuda-memcheck/racecheck doesn’t report any errors and we’ve been pretty thorough in going through the implementation in detail. There’s definitely no host-side corruption, we can check that at run-time.
I didn’t mention explicitly that the results are repeatable on the same machine. I don’t quite understand that because it’s pretty easy to demonstrate that thread order does vary between runs. Shouldn’t any variations in thread order cause non-associativity issues?
Depending on the details of your processing and the details of your data, differing thread order may or may not cause different final results due to lack of floating-point associativity.
Since you mention the app dates to 2010: If your reference data is very old, it may have been generated on sm_1x devices, which did not provide a single-precision FMA operation, but rather an non-IEEE 754 compliant FMAD instruction. Depending on code and data details, the switch from FMAD to FMA can cause significant numerical differences.
However, CUDA support for sm_1x devices was discontinued some years back, and all GPUs supported currently provide the same set of floating-point hardware operations. Therefore, for deterministic expression evaluations, any numerical differences should be down to different code generation and / or different libraries (e.g. the standard math library is being continuously improved for performance and accuracy).
Yeah I guess now it’s all about the details.
The date was just context, I’ve recently inherited responsibility for this code which hasn’t really changed since then. Up-to-date reference data is periodically generated on the production machine, the current set is only a few weeks old. Absolute accuracy is not the goal at this stage, we need to be able to reproduce production results on our development machines which so far we can’t.
We’re only working with sm_30 and above at this stage.
So basically the nondeterministic part is a combination of the non-guarantee of Cuda thread execution order and the previously mentioned implementation details.
Just to add to this, we also observe this behaviour in our 2D hydrodynamic modelling code.
Our development machines consist of a GTX 1080 (sm_61) and a GTX 690 (sm_30). The code performs a number of floating point reductions per timestep and we compile the code to PTX only, based on sm_30 (i.e. let the driver JIT compile it to SASS).
For a particular GPU architecture (e.g. sm_30) and given the same input data, our integration tests always produce the same results. However, when comparing the results generated by different GPU architectures (e.g. sm_30 vs sm_61), there are always very minor differences in output (in the order of 10E-6). This isn’t too much of a concern and we’ve always just assumed these are due to differences in the order of floating point operations between GPU architectures.
I was sort-of expecting errors around 10E-6 or smaller, which this thread has mostly confirmed.
What we’re seeing can be as large as 10E-1 but that’s after some nonlinear processing and I haven’t been able to check what the error would be before the nonlinear stages yet.
Depending on the computation, it is entirely possible that initially tiny differences from a non-deterministic reduction get magnified into very substantial ones, in particular if the problem is ill-conditioned. Long-running simulations can easily be affected by the “butterfly effect”.