Speed of double precision CUDA atomic operations on Kepler K20

I’m comparing the performance of atomic operations in double precision arithmetics between a Fermi GT540M card and a Kepler K20.

I have a kernel performing, among other operations, atomic additions. I’m using the device function

__device__ double atomicAdd(double* address, double val)
{
    unsigned long long int* address_as_ull =
                          (unsigned long long int*)address;
    unsigned long long int old = *address_as_ull, assumed;
    do {
        assumed = old;
            old = atomicCAS(address_as_ull, assumed,__double_as_longlong(val +
                               __longlong_as_double(assumed)));
    } while (assumed != old);
    return __longlong_as_double(old);
}

The presentation Inside Kepler at http://developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S0642-GTC2012-Inside-Kepler.pdf
promises 2-10x speedup, but I observe about 1.3.

I recognize that the speedup depends also on the other operations performed by the kernel, but my question is: I’m using the right way to deal with atomic operations in double precision arithmetics on a K20, or there exists, for that architecture, a faster way?

Thank you very much for any advice.

I think you can attribute the 1.3x speedup to the fact that atomicAdd(double*,double) is the only other data type besides S64 that isn’t supported natively.

That is, it’s the worst case scenario atomic primitive while atom.global.add.u32/s32/f32/u64 are not.

Maybe there is a better way to do it but your code looks good to me!