Does CUDA AtomicAdd support add three sequential double type parameters each time?

Hi everyone!
When I apply atomicAdd function in kernel to add 3 consecutive double type parameters (such as A.x, A.y, A.z) to consecutive global physical addresses, a technical issue confused me.

From a performance optimization perspective, it would be more efficient if a single thread could:

  1. Lock a memory block containing the target address (address + 0)
  2. Coherently update all three parameters (A.x, A.y, A.z)
    (thus requiring only one atomic operation instead of three).
    If this approach is valid, the atomic instruction overhead could theoretically be reduced by two-thirds.

Key Question:

Is there a hardware-supported mechanism or programming technique to achieve such batched atomic updates?

Current Reality:
To add A.x , A.y , A.z to global memory , the standard (and perhaps only) method is:

atomicAdd(address + 0,  A.x);
atomicAdd(address + 1,  A.y);
atomicAdd(address + 2,  A.z);

Any insights from experts would be greatly appreciated!

The basic atomics do not support updates of more than 8 bytes per thread per instruction, or in a few specific instances 16 bytes per thread (with caveats). Directly updating three double locations is not supported by a single instruction/intrinsic, currently.

The cuda standard library has various functionality that could be relevant, including more atomic functionality, and other things like semaphores that could be useful for building critical sections. However at the current time there is no method that I know of to update three double quantities in a single instruction/intrinsic.

CUDA has had and currently has a maximum memory access limitation of 16 bytes per thread (independent of atomic usage) in a single HW (SASS) instruction. Therefore expecting to update 24 bytes in a single instruction/intrinsic is probably unreasonable, in any CUDA setting. Multiple steps would be required, even in a critical section. By “steps” here I am referring to what actually happens at the hardware or SASS level.

You can write C++ code that might appear to update 24 bytes at a time i.e. in a single line of code, but this will likely require a decomposition of some sort, that ultimately devolves into multiple steps (i.e. lines of C++ code), even before you reach the SASS level.

A single thread can do that if you use a critical section methodology, and my previous link here gives a basic demonstration.

However, it is not:

  • “from a performance optimization perspective”
  • atomic in the sense of using HW atomics for the actual update. It is using HW atomics to create the critical section, i.e. section of code that only a single thread can execute “at a time”
  • requiring only one atomic operation (it does require perhaps/possibly only one atomic operation instead of 3, but it also requires “other memory accesses” in the critical section, to complete the update - and in a highly contended setting the critical section method might require many atomic operations, by my read of things)
  • reducing overhead by two-thirds.

On that last point, you would have to benchmark it of course, but I’m fairly confident that the “current reality”:

is going to be faster than a critical section based method. Global HW atomics in CUDA on modern GPUs are quite fast, in my experience, such that you can find reports on these forums where people seem to demonstrate that they are as fast as or faster than “equivalent” ordinary memory accesses.

Hi Robert,
Thanks for your advice. I will try the critical section methodology what you mentioned above and compare with the “current reality”. Once again, thank you for your reply!

In the case that all the threads in your warp update data at the same time, perhaps it would be enough, if just one thread locks the critical section and you synchronize the warp at the beginning and at the end of the section.

If this were my code, I would look into changing the data representation (possibly only locally around the atomic updates) such that the data to be updated can be updated in a single atomic operation.

In particular, some kind of fixed-point representation might offer sufficient accuracy and dynamic range.

Hi njuffa,
Thanks for your advice.

There is no doubt that this approach is efficient when the data written to the global address space by all threads has been updated through a single atomic operation. However , such data might be accessed by threads from different blocks , making the utilization of data locality challenging . Are there any methodologies to enhance data locality exploitation?

Sorry, I have no solid notion what you are asking about in your follow-up question.

Consider describing in concrete terms what the code in question is doing, and what bottlenecks have been identified in profiler driven analysis.