how to improve float array summation precision and stability?

I want to write a kernel which need to take the summation of a float data array; but I find the result is not precise, is there some guideline or method to improve float array summation’s precision? Thanks very much!

one possibility:

  • use double instead of float
    another possibility:
  • sort the data from smallest (magnitude or absolute value) to largest. Make the input of your parallel reduction a grid-stride loop, which starts from the smallest end of the data and moves to the largest.

Thanks, I tried double and the result is acceptable. But I just think that double data type need more workspace(x2) than float; And I tried Kahan algorithm but not improved; I will try your second suggestion.

I am very surprised that the use of Kahan summation did not result in any improvement; you might want to double check your implementation. For further ideas, you could peruse this survey paper:
John Michael McNamee, “A Comparison Of Methods For Accurate Summation.” ACM SIGSAM Bulletin, Vol. 38, No. 1, March 2004, pp. 1-7

Also, have a look at these papers (the last specifically mentions GPUs):
Siegfried M. Rump, “Ultimately Fast Accurate Summation.” SIAM J. Sci. Comput., Vol. 31, No. 5, Sep. 2009, pp. 3466–3502
Siegfried Rump, Takeshi Ogita, and Shinichi Oichi, “Fast high precision summation.” Nonlinear Theory and Its Applications, IEICE 1.1 (2010): 2-24
James Demmel and Hong Diep Nguyen, “Fast Reproducible Floating-Point Summation.” In: 21st IEEE Symposium on Computer Arithmetic (ARITH21), April 2013, pp. 163-172

I haven’t read any of the papers or studied the Kahan summation method, but it seems like there should be a fairly straightforward somewhat parallelizable algorithm to develop an answer for the most accurate possible sum.

I will assume a binary floating point representation such as IEEE-754, which consists of a fixed-width bit field containing the mantissa and a fixed-width bit field containing the exponent. For the purpose of illustration I will assume that all numbers are positive, but I don’t think extending this to handle both positive and negative numbers presents any particular challenge.

Let’s consider each number to be summed to include a high-set-bit and a low-set-bit. Each bit will have an appropriate place-value identification. Both of these bits are set, and the difference in place-value cannot exceed the width of the mantissa in the representation. In between the high-set-bit and low-set-bit for a number, there are a collection of other bits of values of zero or one, as indicated by the mantissa.

Across the array of all numbers to be summed, determine the highest place-value of the high-set-bit, and the lowest place value of the low-set-bit across the array of numbers. This could be performed in parallel using max and min reductions.

Starting from the lowest place value of the low-set-bit, up to the highest place value of the high-set-bit, do:

  • identify which values have a set bit at that place value. The result of this step is a zero or one, for each thread/number (this is trivially parallelizable)
  • perform a sum reduction on the output from the previous step (parallelizable)
  • add the result, if any, from the previous iteration, shifted right by one bit
  • using the result from the previous step, assign a zero to the final answer bit position that corresponds to the place-value for this loop iteration, if the result of the previous step is even. Otherwise assign a one to the final answer bit position for this loop iteration.
  • repeat for the next iteration/place-value/bit position

Although I have indicated that the above loop must proceed from the lowest set bit position to the highest set bit position, in fact it must proceed until the highest bit position and then until the result of adding the previous iteration result, shifted right by 1, is zero.

When the final result is thus assembled, it will consist of potentially a very long binary word. That very long binary word will need to be truncated/rounded to fit in the desired result representation.

The above arithmetic is essentially all integer arithmetic, and with appropriate choice of intermediate integer quantities (e.g. unsigned long long) it should be able to handle “arbitrary” array sizes.

The above method could certainly be optimized for performance. For example, arrangement of the numbers from largest to smallest would allow entire threadblocks to retire early, or even traverse over a fixed subset of the place value range, as determined by their subset of numbers.

It’s not obvious to me that any method could give a more accurate representation of the true result than this. Furthermore there should be no variability in result, from run-to-run, or from machine-to-machine.

A quick read of Robert Crovella’s description suggests that what he is proposing is similar (if not identical) to the “long accumulator” method, originally proposed for the accurate computation of dot products some forty years ago. Recent work on this (which I have not read yet):
Alexander Dallmann, Fabian Jörg, Marco Nehmeier and Jürgen Wolff von Gudenberg, "Correctly Rounded Dot Product in CUDA. (2014)
Sylvain Collange, David Defour, Stef Graillat, Roman Iakymchuk, “Full-Speed Deterministic Bit-Accurate Parallel Floating-Point Summation on Multi- and Many-Core Architectures” (Sep. 2015)

This paper also provides a possibly useful survey of older techniques:

Thanks for these sharing, I will read these papers first.

Late to the party but Kahan summation (and other 2sum like constructions) can/will break if the compiler is allowed to re-associate an expression.

The CUDA compiler is conservative (at least up to version 8.0, which is the most recent I have tried) and does not re-associate floating-point expressions the way certain compilers for CPUs do by default. The only exception to this is that it will apply contraction of floating-point add with dependent floating-point multiply into FMA at full optimization (which is default); this can be turned off globally by specifying -fmad=false on the nvcc command line.

This was a conscious design decision, as the aggressive re-association of floating-point expressions at high optimization levels seen with other compilers makes a hash out of various numerically sensitive algorithms. Since Kahan summation does not involve multiplies, FMA contraction is not in the picture when adding up vector elements as described by OP.

CUDA also offers intrinsics __fadd_rn(), __fmul_rn() (and double-precision __dadd_rn(), __dmul_rn()) to prevent FMA contraction on a case by case basis. Generally FMA contraction benefits both performance and accuracy.