Atomic float operations. especially add

If there is one thing I’m missing in cuda it’s atomic float operations!
Or more specific atomic float additions.

A lot of times i find my algorithms to calculate seperate pieces of a result
in a non regular fashion that need to be added up. Consider photon mapping
just to throw in a example. All the ray tracing can be done easily in parallel.
Just the hit location is not so predictable so adding that up needs fine granular
synchronisation. Having to do this with integers is possible but a major
bummer and limiting.

There is probably a way to do hierarchical locks but I don’t find that very
satisfying considering it could be so much simpler (and more efficient !!!?).

If i understand the ptx def. correctly it is prepared for atomic float operations
only the hardware does not support them yet. So my question is if atomic
float operations are going to be implemented and if so when.

Thank you,

If you put everything 0, and only adding where you have a result and then performing a reduction afterwards is not possible?

Atomic operations are usually killing for performance.

How would you deal with FP operations’ lack of associativity?

A+(B+C) != (A+B)+C

As much as I love CUDA, for me, the lack of this feature is the biggest drawback of CUDA.

In some cases, I know in advance when there will be multiple threads writing to the same memory location, and I know precisely which threads will be involved. In that case, I’m okay with manually serializing the summation. However, there are some algorithms (kernel density estimation for example) where the memory locations to which we will be writing to are totally data-dependent, and the data is totally unstructured. I am not aware of a general solution to this problem in CUDA. The histogram SDK examples are able to solve a special case of this problem where the output is tiny and can be kept in shared memory. But what if I want to write to a grid of size 512^3, or bigger? The answer is that it’s not possible in that case. Furthermore, the trick used to resolve collisions, by encoding the thread index into the data, is a hack. Is it even possible when accumulating floating point numbers? And, are we expected to give up all those bits?

Does it really matter? I would be happy with either A+(B+C) or (A+B )+C. What isn’t acceptable is that without atomic operations I may get A+B or B+C, or just A or just C … OpenGL deals with this already somehow when it does additive blending. Why can’t CUDA do the same?

I wrote a function to do atomicFloatAdd, I hope it work.

__device__ inline void atomicFloatAdd(float *address, float val)


       int i_val = __float_as_int(val);

       int tmp0 = 0;

       int tmp1;

      while( (tmp1 = atomicCAS((int *)address, tmp0, i_val)) != tmp0)


               tmp0 = tmp1;

               i_val = __float_as_int(val + __int_as_float(tmp1));




I’m glad I don’t have to :D And if you need that precision I don’t think having atomic operations or not will make a big difference in a solution for making this problem work in parallel way.

Not sure if i understand you correct. Do you mean creating a list of intermediate results and taking breakes to join them? No, for the use cases I have in mind that would be a bad performance and resource killer. Though I can imagine cases where this is good.

Hey this looks good! I just hope the atomics are not that much of a performance showstoper. If I read the programming guide correctly a atomic operation only halts accesses to the very same address. So there is no extra penalty involved?!

Probably want to throw in a isnan() there as well.

Thanx for sharing!

__device__ inline void atomicFloatAdd(float *address, float val)


 Â  Â  Â  int i_val = __float_as_int(val);

 Â  Â  Â  int tmp0 = 0;

 Â  Â  Â  int tmp1;

      while( (tmp1 = atomicCAS((int *)address, tmp0, i_val)) != tmp0)

 Â  Â  Â  {

 Â  Â  Â  Â  Â  Â  Â  tmp0 = tmp1;

 Â  Â  Â  Â  Â  Â  Â  i_val = __float_as_int(val + __int_as_float(tmp1));

 Â  Â  Â  }


That completely killed my performance for some reason (actually it’s not so hard to see what must have happened) - took 20 times longer than using the integer atomicAdd. I will go back to mapping my floats onto integers (scaling appropriately), adding and then mapping back later. It’s a nightmare with precision and overflow pitfalls but still. By the way, the atomicAdd is only marginally slower than not updating at all, that’s great.

can any point me toward a resource on scaling floats onto ints? I want to try to use the atomic int operations with floats (precision is not a huge deal)…

It’s going to depend on your problem, but can you change the algorithm? When faced with a similar need for atomic addition, I was able to get around it by changing the algorithm from a scatter to a gather (albeit not without some pain).