Performance of fire-and-forget atomics vs non-atomic writes

According to the text The CUDA Handbook: A Comprehensive Guide to GPU Programming by Nicholas Wilt

At the hardware level, atomics come in two forms: atomic operations that return the value that was at the specified memory location before the operator was performed, and reduction operations that the developer can “fire and forget” at the memory location, ignoring the return value. Since the hardware can perform the operation more efficiently if there is no need to return the old value.

Others claim that

Atomics have “fire-and-forget” semantics. This means that the kernel calls the atomic operation and lets the actual atomic operation be executed by the cache (not on the the SM), and the kernel will move on the the next instruction without waiting for the actual atomic operation to complete. This only works if there is no return value from the atomic operation, which is the case in this example. The fire-and-forget semantics let the SM get on with it’s computations, offloading the computation of the atomic to the cache.

All of this would seem to suggest that there is no apparent delays when using fire-and-forget atomic writes, and that they are possibly as fast as non-atomic writes. I suspect that this is not/cannot be be true. Is there any information available on the write performance of both types of write operations.