Atomic float operations. especially add

Pastmaster · May 21, 2008, 3:31pm

If there is one thing I’m missing in cuda it’s atomic float operations!
Or more specific atomic float additions.

A lot of times i find my algorithms to calculate seperate pieces of a result
in a non regular fashion that need to be added up. Consider photon mapping
just to throw in a example. All the ray tracing can be done easily in parallel.
Just the hit location is not so predictable so adding that up needs fine granular
synchronisation. Having to do this with integers is possible but a major
bummer and limiting.

There is probably a way to do hierarchical locks but I don’t find that very
satisfying considering it could be so much simpler (and more efficient !!!?).

If i understand the ptx def. correctly it is prepared for atomic float operations
only the hardware does not support them yet. So my question is if atomic
float operations are going to be implemented and if so when.

Thank you,
Philipp

E.D_Riedijk · May 21, 2008, 5:40pm

If you put everything 0, and only adding where you have a result and then performing a reduction afterwards is not possible?

Atomic operations are usually killing for performance.

nwilt · May 21, 2008, 6:37pm

How would you deal with FP operations’ lack of associativity?

A+(B+C) != (A+B)+C

acorrigan · May 21, 2008, 8:07pm

As much as I love CUDA, for me, the lack of this feature is the biggest drawback of CUDA.

In some cases, I know in advance when there will be multiple threads writing to the same memory location, and I know precisely which threads will be involved. In that case, I’m okay with manually serializing the summation. However, there are some algorithms (kernel density estimation for example) where the memory locations to which we will be writing to are totally data-dependent, and the data is totally unstructured. I am not aware of a general solution to this problem in CUDA. The histogram SDK examples are able to solve a special case of this problem where the output is tiny and can be kept in shared memory. But what if I want to write to a grid of size 512^3, or bigger? The answer is that it’s not possible in that case. Furthermore, the trick used to resolve collisions, by encoding the thread index into the data, is a hack. Is it even possible when accumulating floating point numbers? And, are we expected to give up all those bits?

Does it really matter? I would be happy with either A+(B+C) or (A+B )+C. What isn’t acceptable is that without atomic operations I may get A+B or B+C, or just A or just C … OpenGL deals with this already somehow when it does additive blending. Why can’t CUDA do the same?

hqyang · May 21, 2008, 11:44pm

I wrote a function to do atomicFloatAdd, I hope it work.

__device__ inline void atomicFloatAdd(float *address, float val)

{

       int i_val = __float_as_int(val);

       int tmp0 = 0;

       int tmp1;

      while( (tmp1 = atomicCAS((int *)address, tmp0, i_val)) != tmp0)

       {

               tmp0 = tmp1;

               i_val = __float_as_int(val + __int_as_float(tmp1));

       }

}

acorrigan · May 22, 2008, 1:24am

Great!

Pastmaster · May 22, 2008, 3:22am

I’m glad I don’t have to :D And if you need that precision I don’t think having atomic operations or not will make a big difference in a solution for making this problem work in parallel way.

Not sure if i understand you correct. Do you mean creating a list of intermediate results and taking breakes to join them? No, for the use cases I have in mind that would be a bad performance and resource killer. Though I can imagine cases where this is good.

Pastmaster · May 22, 2008, 4:02am

I wrote a function to do atomicFloatAdd, I hope it work.

__device__ inline void atomicFloatAdd(float *address, float val)

{

       int i_val = __float_as_int(val);

       int tmp0 = 0;

       int tmp1;

      while( (tmp1 = atomicCAS((int *)address, tmp0, i_val)) != tmp0)

       {

               tmp0 = tmp1;

               i_val = __float_as_int(val + __int_as_float(tmp1));

       }

}

[snapback]380935[/snapback]

Hey this looks good! I just hope the atomics are not that much of a performance showstoper. If I read the programming guide correctly a atomic operation only halts accesses to the very same address. So there is no extra penalty involved?!

Probably want to throw in a isnan() there as well.

Thanx for sharing!

odinlake · January 27, 2009, 7:06am

__device__ inline void atomicFloatAdd(float *address, float val)

{

 Â  Â  Â  int i_val = __float_as_int(val);

 Â  Â  Â  int tmp0 = 0;

 Â  Â  Â  int tmp1;

Â  Â  Â  while( (tmp1 = atomicCAS((int *)address, tmp0, i_val)) != tmp0)

 Â  Â  Â  {

 Â  Â  Â  Â  Â  Â  Â  tmp0 = tmp1;

 Â  Â  Â  Â  Â  Â  Â  i_val = __float_as_int(val + __int_as_float(tmp1));

 Â  Â  Â  }

}

That completely killed my performance for some reason (actually it’s not so hard to see what must have happened) - took 20 times longer than using the integer atomicAdd. I will go back to mapping my floats onto integers (scaling appropriately), adding and then mapping back later. It’s a nightmare with precision and overflow pitfalls but still. By the way, the atomicAdd is only marginally slower than not updating at all, that’s great.

johank · July 30, 2009, 10:19pm

can any point me toward a resource on scaling floats onto ints? I want to try to use the atomic int operations with floats (precision is not a huge deal)…

YDD · July 31, 2009, 1:34pm

It’s going to depend on your problem, but can you change the algorithm? When faced with a similar need for atomic addition, I was able to get around it by changing the algorithm from a scatter to a gather (albeit not without some pain).

Topic		Replies	Views
atomicAdd(float,float) - atomicMul(float,float) ... CUDA Programming and Performance	13	56513	July 29, 2010
atomicExch(float) does not work - is it a bug? CUDA Programming and Performance	8	5099	June 27, 2011
Several questions on cuda (arithmetic, rounding, for loop ad performance) CUDA Programming and Performance	8	3390	April 13, 2023
atomicAdd with float2 no API support, workarounds ? CUDA Programming and Performance	23	5173	January 28, 2021
float asssociative Debugging error CUDA Programming and Performance	10	2224	April 12, 2010
AtomicAdd algorithm CUDA Programming and Performance	7	3760	August 25, 2009
How's atomic operations in CUDA implemented? CUDA Programming and Performance cuda , kernel , programming	8	2822	March 26, 2024
Useful Arbitrary Atomic Operation Hack CUDA Programming and Performance	0	10057	July 20, 2008
How to use Atomic add for Complex float data? CUDA Programming and Performance	1	1053	June 18, 2021
Is atomicExch() safe for incremental a global float array? CUDA Programming and Performance	13	18224	July 6, 2009

Atomic float operations. especially add

Related topics