atomicExch(float) does not work - is it a bug?

I’m trying to add my own version of atomicAdd with floats on global memory.

/* Off the topic - it’s a bit weird that this intrinsic is not available out of the box, but anyway */

In the following, all threads on the card attempt to access the same global memory location to add 1.0f ( essentially should count the number of threads invoked)

__device__ void myAtomicAdd(float* addr, float data){

        float a=*addr;

        float b=0;







__global__ void testKernel(float* g_odata)




void run(){


/* initialize kernel h_odata with zero */


    testKernel<<<numBlocks, numThreads>>>(h_odata);

    cudaMemcpy( h_odata, d_odata, memSize, cudaMemcpyDeviceToHost);

    printf("result: %f expected: %d \n",*h_odata,numBlocks*numThreads);



It works correctly for numBlocks=1, numThreads<512.

For 512 threads it gets stuck completely

For numBlocks>1 it gives incorrect results ( less than it should)

I’m using G280 card, 1.3 compute capability

It fails because you have a race condition. In particular, you read the memory but then try to swap it out… what if two threads both read that same initial value and then both started swapping? Both threads would try to add that original value, so you get doubled or lost contributions.

You might use the approach of swapping in a 0, doing any adds and swapping back, repeating until your swap returns a 0. This means that there’s no way for the addition to be “lost” by any races since at all times there’s a clear owner of any particular value (either the memory slot or the thread(s) doing the swaps.

So I haven’t compiled it but try this:

__device__ void myAtomicAdd(float* addr, float data){

 Â  Â  Â  while (data) data=atomicExch(addr, data+atomicExch(addr, 0.0f));


The idea of this is we swap a 0.0 into the slot and grab whatever’s there, do our add, and swap the result back. If we get a 0.0, we’re done. If we DIDN’T get a 0.0 the second time, it means we’ve grabbed some other thread’s partial contribution, and now we’re “own” it and are responsible for adding it back… so we just use a while() loop to keep trying.

The compiler might not like the “while (data)” comparison since it’s always dangerous to compare a float with a specific value like 0.0f, but here it’s appropriate, so ignore any compiler warnings.

The reason that the GPU (and most CPUs too!) don’t provide atomic float addition is because the float addition can’t be done in a single clock with no latency, so it’d be very hard for the memory controller to somehow “hold” a memory location open for multiple ticks, waiting for the result. For an exchange or integer add , the memory can be read, the op performed, and written all in one go. It MAY even be that the integer addition is done by the memory controller and not the ALU, useful for array index indirection computes.

Ouch, how embarrassing… :ph34r: I was in the state of mind of CAS, not exchange… Of course you are right.

Your solution would not work either…

Think of 3 threads with the following interleaving:

  1. T1: x1=swap(0), x1++; 2. T2: x2=swap(0) , x2++; 3. T1: swap(x1)==0 ; 4. T3: x3=swap(0); 5. T2 swap(x2)==0

In this sequence after step 1 T1 has the right value incremented. After step 2 T2 has 0 incremented to 1. At step 3 T1 successfully returns the new value to memory, getting 0 after the swap. After step 4 T3 puts 0 again. And at this moment T2 puts 1 into memory, getting 0 back - everyone is happy, but the counter is 1 now.

No, thread 3 has not finished. There are necessarily no promises about the intermediate state, only the end result is guaranteed to be correct.

Of course, but thread #2 already finished and it set incorrect result… What I’m saying is that the result of the sum will not be OLDVALUE+3 as required, but merely 2, since thread 3 will indeed fail to update, will reread 1 from memory, and happily increase it to 2. 2!=3

I think you completely missed the trick of all this.

Thread 3 will swap the x3 it increased (which then is OLDVALUE+2, one increase from thread 1 and one from thread 3) with the memory value 1.

It then notices it has a value != 0 (namely the one from thread 2) and thus something is wrong, thus it “reads” again the value in memory (the OLDVALUE+2) and adds to that the value it holds (the 1 from thread 2), giving OLDVALUE+3 which it stores.

The “loop invariant” of this code is that the sum of all variables (the individial thread variables plus the one stored in memory) is always correct.

This is relatively simple to see by writing it in a way where this is obviously true in each line:

while (data) {

 Â data += atomicExch(addr, 0.0f);

 Â data = atomicExch(addr, data);


The first line in the loop basically “moves” the data from global to local, and the second swap local and global - none modifiy the sum of both.

From a proving standpoint, the tricky thing is showing that this is certain to finish somewhen.

True, I didn’t get the point from the first attempt, indeed :"> .

I should confess I’m surprised that it works.

Intuitively, atomicExch is relatively weak operation… I would use CAS instead, but it’s not available.

Do you have any formal proof of this thing?

Reimar implied the key assertion when he mentioned “invariants”. That’s how you can analyze both correctness and guaranteed completion.

During an exchange between any thread and the memory location, the

sum of all the “pending” thread values and the memory is invariant. A swap will never lose a value or double one, you’re just exchanging. And the sum step looses no information either. So we know we always get the correct answer.

So all we have to do is see that the repeated sum and exchange loops will not enter into some endless loop. This can be seen by thinking of the number of threads “holding” a value they’re trying to add. Does the number of threads with data always decrease? Yes.

The addition computation is the step that combines two of the values. If two non-zero values are added, then the net number of “pending” values summed over all threads is decreased by one. If we get a zero and add to it, our pending value count is unchanged.

Therefore any potential infinite loop must always have every active thread get a 0.0 during that inner swap step, every time. Can this happen? No, because at least one thread read the original nonzero memory value, and THAT thread will indeed successfully add two nonzero values and therefore monotonically decrease the net number of pending additions. So we start with N threads, N+1 pending values, and that combination will reduce us down to N threads and N pending values.

If the ORIGINAL variable was 0.0 it seems like you might have a problem, but in that case you’re also fine since you still end up with N threads and N pending values.

Hi, this looks like a really smart trick. Is there a reference to it? Thanks!