double precision atomicAdd() problem

cuda version : 6.5
GPU : Telsa K40c
compute capability : 3.5>=

double Precision atomicAdd() use this code.

device double atomicAdd(double* address, double val) {
unsigned long long int* address_as_ull = (unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;

            do { 
                  assumed = old; 
                  old = atomicCAS(address_as_ull, assumed, __double_as_longlong(val+__longlong_as_double(assumed))); 
           } while (assumed != old); 
           return __longlong_as_double(old); 


but, the result of atomicAdd() is different with result of cpu code under 10th decimal place.
Are these differences inevitable?

They might be “inevitable”.
People who expect exact duplication of floating point results between host and device computations are frequently disappointed.

Floating point calculations may produce different results depending on the actual order of operations. Since parallel code running on the device will execute a given algorithm with possibly a different order of operations than the “same” algorithm running on the host, these differences pop up.

If you google “what every computer scientist should know about floating-point arithmetic” you may get some interesting information.