atomicAdd with float2 no API support, workarounds ?

Wanted to bump this thread, as the need for an atomicAdd on a int2 or float2 has come up on a number of projects.

Is there indeed a more efficient method to do an atomicAdd on a 64 bit space other than splitting it into two atomicAdd operations on the .x and .y locations?

Doing two 32-bit atomic adds in sequence is not the same as atomically updating a float2 quantity. In some contexts it might be the same, but I can certainly imagine some programming contexts where it is not the same.

If you require a 64-bit atomic update, then custom atomics using atomicCAS is probably the only solution:

[url]cuda - How can I implement a custom atomic function involving several variables? - Stack Overflow

It’s certainly not “efficient”.

If 2 independent 32-bit atomic adds will suffice for your context, I doubt you’re going to find anything “more efficient”. This thread seems to have wrapped itself around the idea at one point that coalescing is an issue. It might be, but I don’t find that subject documented anywhere. I think a conservative assumption is that there are atomic SFUs in the cache HW, and they behave in some undescribed manner as far as synchronicity goes. This recent SO posting may be of interest:

[url]caching - Can consecutive CUDA atomic operations on global memory benefit from L2 cache? - Stack Overflow

I did develop a way to do atomicAdd() on a complex type which is indeed faster than doing two seperate atomicAdds on two different locations. While it is only about 30% faster than doing two seperate atomicAdds on 32-bit locations, it does work correctly.

PM me if anybody is interested. Only tested using GTX 780ti sofar, but assume it should work on Kepler and beyond.

hello, CudaaduC
can you tell me the way to do atomicAdd() on a complex type?