Generally I find that there are two mechanisms on the GPU that tend to somewhat enhance the accuracy of single-precision computations compared to most CPUs:
(1) Use of FMA (fused multiply-add) is ubiquitous. Compared to separate FMUL/FADD still predominant on x86 CPU, this reduces rounding error, and protects against certain cases of subtractive cancellation that involved products. The latter is actually responsible for most of the improvement in codes that I have studied in detail.
(2) The fact that reductions on the GPU typically involve a tree pattern instead of a linear pattern tends to improve the accuracy of resulting sums since the likelihood of operands of similar magnitude being added is increased.
However, for iterated or complex computations there are often numerical effects that are difficult to anticipate if one is not a trained numerical analyst. These are questions like: What is more accurate, 1.0/sqrt(x) or sqrt(1.0/x)? While I know the answer to this and similar scenarios based on many years of practical experience, I am still surprised many times when I experiment to find the most favorable arrangement (in terms of accuracy) of a particular computation.
I would therefore encourage meaningful experiments against high-accuracy references. I am partial to double-double and arbitrary precision computations as reference, even though they are time consuming. I am extraordinarily cautious with same-precision comparison against other software packages or processors. Einstein supposedly stated: The man with one watch always knows what time it is, the man with two watches can never be sure.
In the past dozen years or so, there have been multiple useful publications (mostly by French authors) that examine the utility of compensated sums, compensated dot products, compensated polynomial evaluation, and the like. The techniques covered can be very helpful in eliminating crucial “accuracy bottlenecks” in computation, similar to the way in which AMBER combines single-precision computation with 64-bit fixed-point accumulators. An added bonus is that many of the techniques benefit from the presence of FMA.
I have had opportunity to use some of these accuracy-enhancing compensation techniques in real-life applications and the performance impact was often minimal, as the widening imbalance between FLOPS and memory bandwidth in recent GPU generations has lead to an increase in “dark FLOPS”. I like to joke that FLOPS have become “too cheap to meter”.
Here is (in no particular order) some relevant content available online for free:
Philippe Langlois and Nicolas Louvet. More Instruction Level Parallelism Explains the Actual Efficiency of Compensated Algorithms
Philippe Langlois and Nicolas Louvet. Solving triangular systems more accurately and efficiently
Stef Graillat, Philippe Langlois, and Nicolas Louvet. Algorithms for Accurate, Validated and Fast Polynomial Evaluation
Philippe Langlois and Nicolas Louvet. Operator Dependant Compensated Algorithms
Stef Graillat. Choosing a Twice More Accurate Dot Product Implementation
S. Graillat, Ph. Langlois, and N. Louvet. Accurate dot products with FMA
Takeshi Ogita, Siegfried M. Rump, and Shin’ichi Oishi. Accurate sum and dot product
Siegfried M. Rump. Ultimately Fast Accurate Summation
Philippe Langlois. 4ccurate 4lgorithms in Floating Point 4rithmetic
Philippe Langlois, Nicolas Louvet. Faithful Polynomial Evaluation with Compensated Horner Algorithm
Stef Graillat, Valerie Menissier-Morain. Compensated Horner scheme in complex floating point arithmetic
Stef Graillat and Valérie Ménissier-Morain. Accurate summation, dot product and polynomial evaluation in complex floating point arithmetic
Stef Graillat, Philippe Langlois, and Nicolas Louvet. Improving the Compensated Horner Scheme with a Fused Multiply and Add
Stef Graillat. Accurate Floating-Point Product and Exponentiation
Claude-Pierre Jeannerod, Nicolas Louvet, and Jean-Michel Muller. Further Analysis of Kahan’s Algorithm for the Accurate Computation of 2x2 Determinants.
Jean-Michel Muller. On the error of computing ab + cd using Cornea, Harrison and Tang’s method