atomicAdd(float) does not add very small values

When I use float atomicAdd(float *address, float val) to add a float value smaller than approx. 1e-39 to 0, the addition does not work, and the value at address remains 0.

Here is the simplest code:

__device__ float test[6] = {0};
__global__ void testKernel() {
    float addit = sinf(1e-20);
    atomicAdd(&test[0], addit);
    test[1] += addit;
    addit = sinf(1e-37);
    atomicAdd(&test[2], addit);
    test[3] += addit;
    addit = sinf(1e-40);
    atomicAdd(&test[4], addit);
    test[5] += addit;

When I run the code above as testKernel<<<1, 1>>>(); and stop with the debugger I see:

test    0x42697800
    [0] 9.9999997e-21
    [1] 9.9999997e-21
    [2] 9.9999999e-38
    [3] 9.9999999e-38
    [4] 0            
    [5] 9.9999461e-41

Notice the difference between test[4] and test[5]. Both did the same thing, yet the simple addition worked, and the atomic one did nothing at all. What am I missing here?

P.S. I’ve done the assignment to addit as sinf(value) to prevent possible compiler optimization on these small numbers.

System info:
Intel i7-3820
32GB of RAM
NVidia Titan

Windows 7x64
CUDA 5.5.20
Driver 331.82

Compiled for CC3.5

It looks like you are finding that atomic float operations do not support denormalized floating point numbers. The smallest number that can be represented by a “normal” single precision float is ~1e-38, but denormalized (or “subnormal”) floats can go smaller. Devices with compute capability 2.0 and later can handle denormalized numbers, but it appears that the atomic operations (which happen outside the CUDA cores) still do not.

You are operating on the ragged edge of what single precision numbers can handle. I would consider switching to double precision if possible. :)

Yes, it seems this is the expected behavior, according to the PTX manual:
“atom.add.f32 rounds to nearest even and flushes subnormal inputs and results to sign-preserving zero.”

I have not seen it mentioned in any other document, though. I think it should be mentioned at least in the “Floating-point standard” appendix of the Programming Guide. You may want to file a request for enhancement to update the documentation.

Ah, good catch! I scanned through the CUDA C Programming Guide and found no explanation, but didn’t think to check the PTX manual. This would be a good thing to explain in the Programming Guide.

Would the same restriction apply to shared memory atomics?

Thanks for alerting us to this issue. I am following up with relevant teams to get the documentation (and in particular the CUDA C Programming Guide) clarified with regard to atomic floating-point adds.

Thank you very much!

I have filed a couple bugs to get the documentation clarified. As determined by experiments, the current behavior of atomic single-precision floating-point adds is as follows:

Regardless of the setting of the compiler flag -ftz,
[1] Atomic single-precision floating-point adds on global memory always operate in flush-to-zero mode
[2] Atomic single-precision floating-point adds on shared memory always operate with denormal support