I need to accumulate results across several threads, i.e. something like
shared float result;
float local = …;
result += local;
for this to work properly I of course need the last += to be atomic and to sync across threads. Is there any way to do this (there doesn’t seem to be an atomic add for floating points, only integers).
What you’re asking is extremely common and useful. It’s called a reduction.
Look at the SDK projects for a very well documented reduction example.
Sometimes reductions are done per-warp, per-block, or per-kernel, though kernel wide ones are awkward (not hard, just annoying).
Atomic operations can ALSO do reductions but that’s usually inefficient. It can still be the best way in some applications, and is most useful when doing kernel wide reductions… often you’d do a parallel reduction for a whole block then use just one atomic add to accumulate that result into the kernel wide sum.
Floating point accumulation is just as easy. You can roll your own floating point atomic ops by using atomicExch(), there’s a thread on this forum from around a month ago.