Alternate version of double precision AtomicAdd()

I have been using the ‘hack’ version listed in the CUDA C programming guide, but am wondering if there are any other implementations which may be more efficient.

The only other method which I can think of is to use the longlong2 type to represent the decimal number in fraction form(use 2 atomicAdds on the longlong pair, then later cast the division into a double result), but that may be too limited and will take up more memory.

I have to believe there are some other implementations out there.

Your idea of using a large fixed-point accumulator (if I get it correctly) makes sense. If you know the order of magnitude of your numbers, even a 64-bit integer may be enough. If scaled properly, a 64-bit fixed-point number actually offers a higher precision than a 64-bit floating point (which has a 53-bit mantissa).

For instance, assuming an accumulator of type long long:
atomicAdd(&accumulator, llrint(scalebn(x, -(MAX_EXP-63)));
(you may have to cast &accumulator to unsigned long long* as there is no atomicAdd on signed ints)

where MAX_EXP is adjusted to be large enough to guarantee that no input, output or intermediate value ever exceed 2^MAX_EXP in magnitude. It is best if you can make a tight estimate, as overestimating the bound will increase the rounding error.

Then you may convert back the final result to floating-point with scalebn((double)accumulator, MAX_EXP-63).

An interesting side effect is that the sum becomes deterministic. Unlike a floating-point sum, the result of the fixed-point sum does not depend on the summation order.