About atomicAdd

Will atomicAdd result in very serious overhead when there is no other access to the same memory address? For example, if I just replace single write back instrution with atomicAdd (while actually, the memory address is only used by one threadblock), will that be much slower?
In my real case, for some reason, for each addr in gmem, I have to had 2 thread block access to it and store their sum there. I wonder if the 2 block are launched in different wave, will atomicAdd instr a nearly 0-overhand instr

According to this old test gpu - CUDA atomic operation performance in different scenarios - Stack Overflow it made a difference in Kepler times, if only few atomic accesses were made to the same memory address.

The overhead can be latency or bandwidth.

The latency could be small (compared to overall cached global memory access speed), but the bandwidth more limited than full L2 bandwidth (e.g. limited adders).

It could make a difference, whether you send in 1, 8, or 32 atomic operations per warp per instruction and whether those accesses are coalesced or not.

If you know, you are in different waves, you could replace the atomic operation with a read, compute and write in the second wave. That could be better or worse than an atomic operation.

I don’t understand this well. If I use atomicAdd just before write back, will L2 bandwidth still have a big problem?

I also don’t have an approximate idea about how much an atomicAdd instruction will be in compare with simple access gmem. If 2 atomic instr in different waves, will it be significant faster than when they in same wave?

The L2 cache has a certain bandwidth (typically around 2x to perhaps 4x global memory speed).

When using global atomic additions, the computation engines inside the L2 could be not enough to maintain the full L2 bandwidth, when a kernel uses atomic operations for every memory instruction.

So if used for some memory operations the speed could be similar to not using atomic operations, when using for many operations the atomic operations could suddenly limit the memory bandwidth.

There is not much Nvidia documentation or 3rd party tests for speed of atomic operations.

So it is part experience (of the community), part rules-of-thumb, part guess-work.

So, if I just need to load some data from gmem at the start of threadblock, with very little needed to load during the process, the use of atomicAdd for write back can be seen as nearly 0-overhead effort as they just affect L2 cache?

It probably will not be significantly faster to access the same address from different waves. Atomic instructions in current GPU generations are quite fast.

You will have to try out and profile to see the effect on your kernel.

The advantage of having several waves is probably not that the atomic operations get faster, but the other way around: If you do not use atomic writes, but normal writes, and write at the same time (not only in the same wave) from several threads, than the memory sub-system can discard all, but one, of the writes. With atomic writes it has to process all writes and not discard any.

So doing atomic writes (to the same address) in one and the same wave or in different waves should make no difference.

(Except if some memory system pipelines have to be cleared and restarted.)

Especially if the amount you write is small (compared to the load) or the duration of the computations.

Try it out, change your atomicAdds to just overwriting store instructions.

If your kernel is significantly slower, it is the effect of the atomic operations.

To see the effect of waves, additionally change the write address so that each write has a different address.