atomicAdd, atomicExch and atomicCAS give random results

Hi there,
I need to use the atomicAdd function on double/single precision floating numbers. Not sure about the performance of cuda’s atomicAdd function vs. the custom ones (implemented through atomic_CAS or atomic_Exch), I’ve tested all 4 different versions that I know :

  1. double precision through atomicCAS, as mentionned in official document

  2. float precision through atomicCAS, similar as above :

  3. float precision through atomicExch

  4. and the cuda’s official one, available only on architecture >= 2.0

The test function is simply : \sum_n=1^N \log(n), all threads write on the same memory address. Surprisingly, for N big (>= 1000) only the version 1. gives correct and stable results, all other versions behave randomly, like a kind of additif noise to the correct results. The random error is of order 1e-2, big enough for the code to be useless in scientific computing. I can’t figure out the reason.

Here’s the code compiled on Tesla C2070 (and on T10 also) with
nvcc main.cu -o main -I ~/include -gencode arch=compute_20,code=sm_20 -Xcompiler="-fpermissive"

Thanks for any suggestions.
main.cu (4.07 KB)

Note that for floating point addition, the result depends on the order of operations performed. And whenever there is a need to use atomic operations, the order is undefined. Which is why you do not get reproducible results.

If you want bit-for-bit reproducible results (I usually do), use a reduction instead of atomic addition. I you think that a relative error of 1.65e-7 is not acceptable while using float variables (I usually don’t), recalibrate your expectations. Google for David Goldberg “What Every Computer Scientist Should Know About Floating Point Arithmetic” to find a good read about that.