Observations related to 32 bit floating point ops and 64 bit floating point ops

This topic comes up often, and since I have been comparing the same calculations performed in 32 bit float vs 64 bit, I thought to share my results.

My comparison scripts were written in MATLAB.

It is important to note that MATLAB’s default is 64 bit(double) and even if you explicitly cast to ‘single’, it will still perform the individual calculations in 64 bit before casting back to 32 bit.
Whether or not this applies to large accumulated sums I am not yet sure, maybe someone else can comment on this issue.

What I have learned is that anything 32 bit done in cuBLAS,cuFFT, or cuSparse will have a high degree of accuracy when directly compared to the same calculations in MATLAB using a ‘single’ cast for all values.

The biggest divergence between the two occurs with the use of accumulated sums. Even then the difference was not too large. A good example was a custom complex DFT kernel I wrote where I did the individual calculations in single, but the sum accumulation was in 64 bit double. Upon finish I cast back to single, and that result had little difference when compared to MATLAB. If only 32 bit floats were used the summation result had a higher absolute difference.

So the area of focus should be on preserving the accuracy of repeated calculations (sum, product etc) which are accumulated over operation of the algorithm. This applies to both atomic and non-atomic operations.

The Amber site has a section which mentions this issue, which mirrors my own findings;


Also important to note that the 32 bit calculations done in CUDA and those via a CPU had similar degrees of error, which indicates that the 32 bit operations on the GPU are not inherently less precise than those done in CPU land.

Given the huge performance difference between 32 bit and 64 bit operations, I believe it is valuable to determine when you need that extra precision and when it is not worth the sacrifice in performance.

The strategy used by the Amber folks (targeted mixed precision) appears to be the best route.

Any other observations on this topic?

Generally I find that there are two mechanisms on the GPU that tend to somewhat enhance the accuracy of single-precision computations compared to most CPUs:

(1) Use of FMA (fused multiply-add) is ubiquitous. Compared to separate FMUL/FADD still predominant on x86 CPU, this reduces rounding error, and protects against certain cases of subtractive cancellation that involved products. The latter is actually responsible for most of the improvement in codes that I have studied in detail.

(2) The fact that reductions on the GPU typically involve a tree pattern instead of a linear pattern tends to improve the accuracy of resulting sums since the likelihood of operands of similar magnitude being added is increased.

However, for iterated or complex computations there are often numerical effects that are difficult to anticipate if one is not a trained numerical analyst. These are questions like: What is more accurate, 1.0/sqrt(x) or sqrt(1.0/x)? While I know the answer to this and similar scenarios based on many years of practical experience, I am still surprised many times when I experiment to find the most favorable arrangement (in terms of accuracy) of a particular computation.

I would therefore encourage meaningful experiments against high-accuracy references. I am partial to double-double and arbitrary precision computations as reference, even though they are time consuming. I am extraordinarily cautious with same-precision comparison against other software packages or processors. Einstein supposedly stated: The man with one watch always knows what time it is, the man with two watches can never be sure.

In the past dozen years or so, there have been multiple useful publications (mostly by French authors) that examine the utility of compensated sums, compensated dot products, compensated polynomial evaluation, and the like. The techniques covered can be very helpful in eliminating crucial “accuracy bottlenecks” in computation, similar to the way in which AMBER combines single-precision computation with 64-bit fixed-point accumulators. An added bonus is that many of the techniques benefit from the presence of FMA.

I have had opportunity to use some of these accuracy-enhancing compensation techniques in real-life applications and the performance impact was often minimal, as the widening imbalance between FLOPS and memory bandwidth in recent GPU generations has lead to an increase in “dark FLOPS”. I like to joke that FLOPS have become “too cheap to meter”.

Here is (in no particular order) some relevant content available online for free:

Philippe Langlois and Nicolas Louvet. More Instruction Level Parallelism Explains the Actual Efficiency of Compensated Algorithms

Philippe Langlois and Nicolas Louvet. Solving triangular systems more accurately and efficiently

Stef Graillat, Philippe Langlois, and Nicolas Louvet. Algorithms for Accurate, Validated and Fast Polynomial Evaluation

Philippe Langlois and Nicolas Louvet. Operator Dependant Compensated Algorithms

Stef Graillat. Choosing a Twice More Accurate Dot Product Implementation

S. Graillat, Ph. Langlois, and N. Louvet. Accurate dot products with FMA

Takeshi Ogita, Siegfried M. Rump, and Shin’ichi Oishi. Accurate sum and dot product

Siegfried M. Rump. Ultimately Fast Accurate Summation

Philippe Langlois. 4ccurate 4lgorithms in Floating Point 4rithmetic

Philippe Langlois, Nicolas Louvet. Faithful Polynomial Evaluation with Compensated Horner Algorithm

Stef Graillat, Valerie Menissier-Morain. Compensated Horner scheme in complex floating point arithmetic

Stef Graillat and Valérie Ménissier-Morain. Accurate summation, dot product and polynomial evaluation in complex floating point arithmetic

Stef Graillat, Philippe Langlois, and Nicolas Louvet. Improving the Compensated Horner Scheme with a Fused Multiply and Add

Stef Graillat. Accurate Floating-Point Product and Exponentiation

Claude-Pierre Jeannerod, Nicolas Louvet, and Jean-Michel Muller. Further Analysis of Kahan’s Algorithm for the Accurate Computation of 2x2 Determinants.

Jean-Michel Muller. On the error of computing ab + cd using Cornea, Harrison and Tang’s method


Thank you Norbert for compiling this list of references!

Along the same line, we are trying to push correctly-rounded sum or dot-product as a computing primitive. Correct rounding would address both accuracy and reproducibility issues.

The overhead is obviously high in the worst case, but reasonably well-behaved sums can still be computed exactly at memory speed. In particular we find that for large sums that do not fit in caches, 4-time compensated summation is still memory-bound on current CPUs and GPUs.

Here is a tech report about it:
Sylvain Collange, Devid Defour, Stef Graillat, and Roman Iakymchuk. Numerical Reproducibility for the Parallel Reduction on Multi- and Many-Core Architectures.

Feedback is welcome!

Interesting. As an alumnus of the University of Karlsruhe (now: KIT) I am reasonably familiar with the work of U. Kulisch et. al. and his push for a correctly rounded dot product, going back to the very early 1990s, if not earlier. Is this work a continuation and/or extension of Kulisch’s work, or are you taking an entirely different approach?

This is definitely inspired by the work of Prof. Kulisch. We use a similar accumulation algorithm (block-wise updates to a large-radix carry-save representation).
However, Kulisch’s work is focused on hardware implementations for sequential processors, as far as I know. Our implementation is software, and targets parallel platforms (multi-thread and SIMD). We also mix the super-accumulation approach with compensated summation, using compensated summation to filter the “easy” cases and only resorting to super-accumulation for outliers.

So the approaches are functionally equivalent, but the implementations differ. In fact, in the parallel and “dark flops” context, I am not convinced we need any specialized hardware at all: what is necessary is support for histogram computations, such as fast atomic addition on shared memory. Fast exponent extraction, conversion to fixed-point and carry bit retrieval also help, but irregular memory accesses are the real bottleneck.