atomicAdd Kahan Summation

I’m writing a kernel that calculates the values of some histogram buckets.

At the end of the kernel, after the histogram buckets have been calculated, I figured I could easily calculate the total sum of all buckets by using an atomicAdd for each block (and bucket), avoiding another call to a reduction kernel, but I’m running into some precision issues.

I was wondering if it’s possible to perform Kahan summation with atomicAdd?

I found an implementation of atomicAdd for doubles, which I could also attempt to use, but I think I’d prefer Kahan summation on floats instead of switching to doubles.

__device__ double atomicAdd(double* __restrict address, double val)
{
    unsigned long long int* address_as_ull = (unsigned long long int*)address;
    unsigned long long int old = *address_as_ull, assumed;
    do {
        assumed = old;
        old = atomicCAS(address_as_ull, assumed,
                        __double_as_longlong(val + __longlong_as_double(assumed)));
    } while (assumed != old);
    return __longlong_as_double(old);
}

Pseudo code demonstrating Kahan summation:

function KahanSum(input)
    var sum = 0.0
    var c = 0.0                  // A running compensation for lost low-order bits.
    for i = 1 to input.length do
        var y = input[i] - c     // So far, so good: c is zero.
        var t = sum + y          // Alas, sum is big, y small, so low-order digits of y are lost.
        c = (t - sum) - y // (t - sum) recovers the high-order part of y; subtracting y recovers -(low part of y)
        sum = t           // Algebraically, c should always be zero. Beware overly-aggressive optimizing compilers!
        // Next time around, the lost low part will be added to y in a fresh attempt.
    return sum

I’m looking for something with the following signature (Using a float2 containing both sum and c.):

__device__ void atomicAddKahan(float2* __restrict address, const float val)

Can I use unions in CUDA? Something like this:

union float2UllUnion {
	float2 f;
	unsigned long long int ull;
};

__device__ void atomicAddKahan(float2* __restrict address, const float val)
{
    unsigned long long int* address_as_ull = (unsigned long long int*)address;
    float2UllUnion old, assumed, tmp;
    old.ull = *address_as_ull;
    do {
        assumed = old;
        tmp = assumed;
        // kahan summation
        const float y = val - tmp.f.y;
        const float t = tmp.f.x + y;
        tmp.f.y = (t - tmp.f.x) - y;
        tmp.f.x = t;

        old.ull = atomicCAS(address_as_ull, assumed.ull, tmp.ull);

    } while (assumed.ull != old.ull);

}

I’m not sure if I’m going about this the right way, so thanks for any help or suggestions.

It seems to work. Comparing it against loop-accumulating and Kahan summing on the CPU, and vs atomicAdd.

Typical results look like this: (results cast to float, and using long double Kahan as reference)

average absolute sum-reference differences (65536 floats, 128 runs): 
 0.0004670023918 (float, loop acum)
 0.0000000000000 (double, loop acum)
 0.0000000000000 (long double, loop acum)
 0.0000014472753 (float, kahan)
 0.0000000000000 (double, kahan)
 0.0000000000000 (long double, kahan)
 0.0000015497208 (atomicAddKahan) (GPU)
 0.0004895096645 (atomicAdd) (GPU)
 0.0000098748133 (blockReduce(float) + AtomicAddKahan) (GPU)
 0.0000314032659 (blockReduce(float) + AtomicAdd(float)) (GPU)
 0.0000088717788 (blockReduce(float) + AtomicAdd(double)) (GPU)

edit:

If i change the distribution of random floats from the range -1.0 to 1.0 to only positive floats with a skewed distribution (many small values, a few bigger ones, which is the case in my own code), then I get these results:

average absolute sum-reference differences (65536 floats, 128 runs): 
 0.0197448730469 (float, loop acum)
 0.0000000000000 (double, loop acum)
 0.0000000000000 (long double, loop acum)
 0.0000000000000 (float, kahan)
 0.0000000000000 (double, kahan)
 0.0000000000000 (long double, kahan)
 0.0000038146973 (atomicAddKahan) (GPU)
 0.0193004608154 (atomicAdd) (GPU)
 0.0000076293945 (blockReduce(float) + AtomicAddKahan) (GPU)
 0.0006484985352 (blockReduce(float) + AtomicAdd(float)) (GPU)
 0.0000057220459 (blockReduce(float) + AtomicAdd(double)) (GPU)

A bit strange that the CPU kahan sum does better than the GPU one. A bug?

By the way, the atomicAddKahan and atomicAdd aren’t realistic use cases as every thread attempts to perform atomic operations. It’s just for testing.
BlockReduce performs shuffle reductions for each warp, stores to shared, then the first warp shuffles again, and then thread 0 performs atomicAdd.

edit2:

Added a KBN variant, which seems to be even more accurate:

average absolute sum-reference differences (65536 floats, 256 runs):
 0.0000474131521 (float, loop acum)
 0.0000000000000 (double, loop acum)
 0.0000000000000 (long double, loop acum)
 0.0000004516041 (float, kahan)
 0.0000000000000 (double, kahan)
 0.0000000000000 (long double, kahan)
 0.0000000000000 (float, KBN)
 0.0000003301539 (atomicAddKahan) (GPU)
 0.0000467669888 (atomicAdd) (GPU)
 0.0000011909142 (blockReduce(float) + AtomicAddKahan) (GPU)
 0.0000033259712 (blockReduce(float) + AtomicAdd(float)) (GPU)
 0.0000011321245 (blockReduce(float) + AtomicAdd(double)) (GPU)
 0.0000000000000 (atomicAddKbn) (GPU)
 0.0000011321245 (blockReduce(float) + AtomicAddKbn(float)) (GPU)

code for the KBN variant:

__device__ void atomicAddKbn(float2* __restrict address, const float val)
{
    unsigned long long int* address_as_ull = (unsigned long long int*)address;
    float2UllUnion old, assumed, tmp;
    old.ull = *address_as_ull;
    do {
        assumed = old;
        tmp = assumed;

        // Kahan and Babuska summation, Neumaier variant
        float t = tmp.f.x + val;
        tmp.f.y += (fabsf(tmp.f.x) >= fabsf(val)) ? ( (tmp.f.x-t) + val ) : ( (val-t) + tmp.f.x );
        tmp.f.x = t;

        old.ull = atomicCAS(address_as_ull, assumed.ull, tmp.ull);

    } while (assumed.ull != old.ull);

}

Interesting, thanks for sharing.

I have also used a similar technique with that 64 bit atomicAdd prototype for an atomicAdd() on the 32 bit complex float2 type(real value in first 32 bits and complex value in second 32 bits, in the same 64 bit word).

It ends up that technique is about 30% faster than performing two separate atomicAdd()s on two floats in different buffers.

There actually are a number of ways one can split up the 64 bit word into sections for atomic update purposes. One could use the first 32 bits as an int for the index of a max value in a large array, then the second 32 bits as a float for the actual max value mapped to that location.

Cool!

Interesting! I would have expected independent atomicAdds to be faster in most cases, but on second though it makes sense.

Concerning Kahan summation, it could also be implemented with independent atomicAdds for the high part and low part. For instance (untested):

__device__ void atomicAddKahan2(float2 * __restrict address, float val)
{
	float oldacc = atomicAdd(&address.x, val);	// Accumulate high-order bits
	float newacc = oldacc + val;		// Recover addition result
	float r = val - (newacc - oldacc);	// Recover rounding error using Fast2Sum, assumes |oldacc|>=|val|
	atomicAdd(&address.y, r);	// Accumulate low-order bits
}