how to improve float array summation precision and stability?

john.ye · November 28, 2018, 3:00pm

I want to write a kernel which need to take the summation of a float data array; but I find the result is not precise, is there some guideline or method to improve float array summation’s precision? Thanks very much!

Robert_Crovella · November 28, 2018, 3:09pm

one possibility:

use double instead of float
another possibility:
sort the data from smallest (magnitude or absolute value) to largest. Make the input of your parallel reduction a grid-stride loop, which starts from the smallest end of the data and moves to the largest.

[url]https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf[/url]

john.ye · November 28, 2018, 3:29pm

Thanks, I tried double and the result is acceptable. But I just think that double data type need more workspace(x2) than float; And I tried Kahan algorithm but not improved; I will try your second suggestion.

njuffa · November 28, 2018, 6:27pm

I am very surprised that the use of Kahan summation did not result in any improvement; you might want to double check your implementation. For further ideas, you could peruse this survey paper:

[url]http://www.iro.umontreal.ca/~mignotte/IFT2425/Documents/AccrateSummationMethods.pdf[/url]
John Michael McNamee, “A Comparison Of Methods For Accurate Summation.” ACM SIGSAM Bulletin, Vol. 38, No. 1, March 2004, pp. 1-7

Also, have a look at these papers (the last specifically mentions GPUs):

[url]http://www.ti3.tu-harburg.de/paper/rump/Ru08b.pdf[/url]
Siegfried M. Rump, “Ultimately Fast Accurate Summation.” SIAM J. Sci. Comput., Vol. 31, No. 5, Sep. 2009, pp. 3466–3502

[url]https://www.jstage.jst.go.jp/article/nolta/1/1/1_1_2/_pdf[/url]
Siegfried Rump, Takeshi Ogita, and Shinichi Oichi, “Fast high precision summation.” Nonlinear Theory and Its Applications, IEICE 1.1 (2010): 2-24

[url]https://people.eecs.berkeley.edu/~hdnguyen/public/papers/ARITH21_Fast_Sum.pdf[/url]
James Demmel and Hong Diep Nguyen, “Fast Reproducible Floating-Point Summation.” In: 21st IEEE Symposium on Computer Arithmetic (ARITH21), April 2013, pp. 163-172

Robert_Crovella · November 28, 2018, 7:59pm

I haven’t read any of the papers or studied the Kahan summation method, but it seems like there should be a fairly straightforward somewhat parallelizable algorithm to develop an answer for the most accurate possible sum.

I will assume a binary floating point representation such as IEEE-754, which consists of a fixed-width bit field containing the mantissa and a fixed-width bit field containing the exponent. For the purpose of illustration I will assume that all numbers are positive, but I don’t think extending this to handle both positive and negative numbers presents any particular challenge.

Let’s consider each number to be summed to include a high-set-bit and a low-set-bit. Each bit will have an appropriate place-value identification. Both of these bits are set, and the difference in place-value cannot exceed the width of the mantissa in the representation. In between the high-set-bit and low-set-bit for a number, there are a collection of other bits of values of zero or one, as indicated by the mantissa.

Across the array of all numbers to be summed, determine the highest place-value of the high-set-bit, and the lowest place value of the low-set-bit across the array of numbers. This could be performed in parallel using max and min reductions.

Starting from the lowest place value of the low-set-bit, up to the highest place value of the high-set-bit, do:

identify which values have a set bit at that place value. The result of this step is a zero or one, for each thread/number (this is trivially parallelizable)
perform a sum reduction on the output from the previous step (parallelizable)
add the result, if any, from the previous iteration, shifted right by one bit
using the result from the previous step, assign a zero to the final answer bit position that corresponds to the place-value for this loop iteration, if the result of the previous step is even. Otherwise assign a one to the final answer bit position for this loop iteration.
repeat for the next iteration/place-value/bit position

Although I have indicated that the above loop must proceed from the lowest set bit position to the highest set bit position, in fact it must proceed until the highest bit position and then until the result of adding the previous iteration result, shifted right by 1, is zero.

When the final result is thus assembled, it will consist of potentially a very long binary word. That very long binary word will need to be truncated/rounded to fit in the desired result representation.

The above arithmetic is essentially all integer arithmetic, and with appropriate choice of intermediate integer quantities (e.g. unsigned long long) it should be able to handle “arbitrary” array sizes.

The above method could certainly be optimized for performance. For example, arrangement of the numbers from largest to smallest would allow entire threadblocks to retire early, or even traverse over a fixed subset of the place value range, as determined by their subset of numbers.

It’s not obvious to me that any method could give a more accurate representation of the true result than this. Furthermore there should be no variability in result, from run-to-run, or from machine-to-machine.

njuffa · November 28, 2018, 8:11pm

A quick read of Robert Crovella’s description suggests that what he is proposing is similar (if not identical) to the “long accumulator” method, originally proposed for the accurate computation of dot products some forty years ago. Recent work on this (which I have not read yet):

[url]https://www.math.uu.se/digitalAssets/562/c_562622-l_1-k_nehmeier_slides.pdf[/url]
Alexander Dallmann, Fabian Jörg, Marco Nehmeier and Jürgen Wolff von Gudenberg, "Correctly Rounded Dot Product in CUDA. (2014)

[url]https://hal.archives-ouvertes.fr/hal-00949355v1/document[/url]
Sylvain Collange, David Defour, Stef Graillat, Roman Iakymchuk, “Full-Speed Deterministic Bit-Accurate Parallel Floating-Point Summation on Multi- and Many-Core Architectures” (Sep. 2015)

Robert_Crovella · November 28, 2018, 8:30pm

This paper also provides a possibly useful survey of older techniques:

[url]https://hal.inria.fr/inria-00074262/document[/url]

john.ye · November 29, 2018, 1:28am

Thanks for these sharing, I will read these papers first.

marc.b.reynolds · January 15, 2019, 8:40am

Late to the party but Kahan summation (and other 2sum like constructions) can/will break if the compiler is allowed to re-associate an expression.

njuffa · January 15, 2019, 8:58am

The CUDA compiler is conservative (at least up to version 8.0, which is the most recent I have tried) and does not re-associate floating-point expressions the way certain compilers for CPUs do by default. The only exception to this is that it will apply contraction of floating-point add with dependent floating-point multiply into FMA at full optimization (which is default); this can be turned off globally by specifying -fmad=false on the nvcc command line.

This was a conscious design decision, as the aggressive re-association of floating-point expressions at high optimization levels seen with other compilers makes a hash out of various numerically sensitive algorithms. Since Kahan summation does not involve multiplies, FMA contraction is not in the picture when adding up vector elements as described by OP.

CUDA also offers intrinsics __fadd_rn(), __fmul_rn() (and double-precision __dadd_rn(), __dmul_rn()) to prevent FMA contraction on a case by case basis. Generally FMA contraction benefits both performance and accuracy.

Topic		Replies	Views
Maximum native precision or mixed-precision to have 21 digit floating point type? CUDA Programming and Performance mixed-precision	1	671	January 14, 2022
floating point precision on CUDA CUDA Programming and Performance	11	14768	June 8, 2010
INACCURACY OF FLOAT DATA TYPE FLOAT DATA TYPE BECOME INACCURATE NEAR ABOUT 2^15 CUDA Programming and Performance	12	2452	July 10, 2010
Reduction Reduction Reduction................. Precision Confusion Race Condition...... HELP! CUDA Programming and Performance	16	10463	December 8, 2009
Prevent ncc from applying MADD optimization precision, IEEE 754 CUDA Programming and Performance	11	10996	April 6, 2009
Reduce choice CUDA Programming and Performance	25	31	March 23, 2025
Working with large numbers Help to calculate an harmonic sum CUDA Programming and Performance	4	1732	June 23, 2009
Is there any method to enhance precision? CUDA Programming and Performance	2	4293	January 3, 2009
CUDA kernel converging to CPU double result thought G80 is single precision? CUDA Programming and Performance	6	8610	March 7, 2008
Observations related to 32 bit floating point ops and 64 bit floating point ops CUDA Programming and Performance	5	3309	March 5, 2015

how to improve float array summation precision and stability?

Related topics