I need to use the atomicAdd function on double/single precision floating numbers. Not sure about the performance of cuda’s atomicAdd function vs. the custom ones (implemented through atomic_CAS or atomic_Exch), I’ve tested all 4 different versions that I know :

double precision through atomicCAS, as mentionned in official document

float precision through atomicCAS, similar as above :

float precision through atomicExch

and the cuda’s official one, available only on architecture >= 2.0
The test function is simply : \sum_n=1^N \log(n), all threads write on the same memory address. Surprisingly, for N big (>= 1000) only the version 1. gives correct and stable results, all other versions behave randomly, like a kind of additif noise to the correct results. The random error is of order 1e2, big enough for the code to be useless in scientific computing. I can’t figure out the reason.
Here’s the code compiled on Tesla C2070 (and on T10 also) with
nvcc main.cu o main I ~/include gencode arch=compute_20,code=sm_20 Xcompiler="fpermissive"
Thanks for any suggestions.
