I’m writing a kernel that calculates the values of some histogram buckets.

At the end of the kernel, after the histogram buckets have been calculated, I figured I could easily calculate the total sum of all buckets by using an atomicAdd for each block (and bucket), avoiding another call to a reduction kernel, but I’m running into some precision issues.

I was wondering if it’s possible to perform Kahan summation with atomicAdd?

I found an implementation of atomicAdd for doubles, which I could also attempt to use, but I think I’d prefer Kahan summation on floats instead of switching to doubles.

```
__device__ double atomicAdd(double* __restrict address, double val)
{
unsigned long long int* address_as_ull = (unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;
do {
assumed = old;
old = atomicCAS(address_as_ull, assumed,
__double_as_longlong(val + __longlong_as_double(assumed)));
} while (assumed != old);
return __longlong_as_double(old);
}
```

Pseudo code demonstrating Kahan summation:

```
function KahanSum(input)
var sum = 0.0
var c = 0.0 // A running compensation for lost low-order bits.
for i = 1 to input.length do
var y = input[i] - c // So far, so good: c is zero.
var t = sum + y // Alas, sum is big, y small, so low-order digits of y are lost.
c = (t - sum) - y // (t - sum) recovers the high-order part of y; subtracting y recovers -(low part of y)
sum = t // Algebraically, c should always be zero. Beware overly-aggressive optimizing compilers!
// Next time around, the lost low part will be added to y in a fresh attempt.
return sum
```

I’m looking for something with the following signature (Using a float2 containing both sum and c.):

```
__device__ void atomicAddKahan(float2* __restrict address, const float val)
```

Can I use unions in CUDA? Something like this:

```
union float2UllUnion {
float2 f;
unsigned long long int ull;
};
__device__ void atomicAddKahan(float2* __restrict address, const float val)
{
unsigned long long int* address_as_ull = (unsigned long long int*)address;
float2UllUnion old, assumed, tmp;
old.ull = *address_as_ull;
do {
assumed = old;
tmp = assumed;
// kahan summation
const float y = val - tmp.f.y;
const float t = tmp.f.x + y;
tmp.f.y = (t - tmp.f.x) - y;
tmp.f.x = t;
old.ull = atomicCAS(address_as_ull, assumed.ull, tmp.ull);
} while (assumed.ull != old.ull);
}
```

I’m not sure if I’m going about this the right way, so thanks for any help or suggestions.