Where do atomic operations go, and why are atomics to __shared__ faster than those to GMEM?

I had thought I had it straight–atomicAdd() and other such operations to addresses in __shared__ memory were faster because they could access the block’s own, explicitly allocated L1 memory and perform an atomic operation on it, then keep going. Atomic operations to global memory (GMEM), in contrast, would pull the address into a coherent L2 location and then perform the atomic operation on it, where other blocks could also get access to it. And my experience has been that atomic operations to __shared__ are about twice as fast, but only twice as fast, as those to global operations. I also believe I recall reading some official numbers to this effect.

But now, I’m having a conversation offline with a colleague, and he’s explaining to me that all atomic operations are performed in L2 cache–they bypass L1. Is that correct? If so, is it just the speed of going L1 => L2 and back to L1 that is saving the time, as opposed to GMEM => L2 => GMEM? We have been discussing the following __device__ function I wrote last month, and have been having some success with:

/// Add a floating point number to two integer accumulators, one intended to absorb the vast
/// majority of the work and the other kept in reserve in case the primary accumulator cannot
/// handle the value by itself.
///
/// \param fval  Single-precision floating point number, pre-scaled to the fixed-precision range
/// \param pos   Position in each accumulator array to contribute the results
/// \param primary   The first accumulator, which receives most of the traffic
/// \param overflow  The emergency accumulator
__device__ __forceinline__ void splitForceContribution(const float fval, const int pos,
                                                       int* primary, int* overflow) {
  int ival;
  if (fabsf(fval) >= max_int_accumulation_f) {
    const int spillover = fval / max_int_accumulation_f;
    ival = __float2int_rn(fval - ((float)(spillover) * max_int_accumulation_f));
    atomicAdd(&overflow[pos], spillover);
  }
  else {
    ival = __float2int_rn(fval);
  }
  const int prim_old = atomicAdd(&primary[pos], ival);
  const int prim_old_plus_ival = prim_old + ival;
  if ((prim_old ^ prim_old_plus_ival) < 0 && (prim_old ^ ival) >= 0) {
    atomicAdd(&overflow[pos], (1 - (2 * (ival < 0))) * 2);
  }
}

My colleague has pointed out some further improvements, but the central questions I’ve got for this forum are why I can be finding that this is faster (more than twice as fast, in terms of the mere operation, and also a significant performance boost in the context of an actual kernel calculating forces on particles) than doing int64 atomics (convert long long int to unsigned long long int, then add to a long long int array recast as (unsigned long long int*)). My colleague also mentioned “fire and forget” mechanics in atomics. I had wondered if such a thing existed, as it would be convenient to just heap the instructions into some extra queue and then let the thread continue processing. Apparently, these are a thing, and the fact that I am foregoing that advantage (which I would have if I were just accumulating in int64) to check the result of the primary accumulation (to see if I must add more to the secondary accumulator) makes it even more intriguing that I would see a performance improvement.

I had implemented this in the interest of conserving __shared__ memory, where I intended to do the accumulations in the low 32 bits (which, 99.999% of the time, will be all there is to do), while storing the secondary accumulators out in GMEM, seldom to be brought into play and therefore freeing up L1 cache. But my colleague’s observations have made me wonder how I’ve gotten away with what I did. Given what he says, I can definitely see why he didn’t pursue such routes years ago. I seem to be coming up with tricks that others know better than to attempt…

A shared atomic could not possibly work that way. A few minutes of thought will convince you, possibly.

A shared atomic by definition affects shared memory and by definition does not affect global memory. Shared memory is a per-SM resource, only. Shared memory is not a device-wide resource. L2, based on any memory hierarchy diagram that I have ever seen, is a device wide resources and caches global (and local) accesses to device memory (that is, off-chip DRAM).

It’s not logical to conclude that shared atomics are resolved in the L2.

Likewise, the L1 is not involved in the shared memory space. From a logical perspective, connecting either the L1 or L2 cache with shared space accesses makes no sense to me, and there is no documentational support for it as far as I know.

I understand the notion that the physical resource backing shared memory, and the physical resource backing the L1 cache may be the “same” resource, but from a logical perspective they are separate. A shared access does not impact the state of global or local memory, and the L1 and L2 act as caches for global and local memory.

Regarding the question about shared memory atomic performance, since at least Maxwell, it’s been no secret that shared atomics could be faster than global. If you’re asking “why?” as in “what is the detailed chip design architecture that indicates that this will be a faster path” I don’t have that to offer. But it evidently doesn’t get resolved the same way as global atomics, in Maxwell and newer.

In many cases, it will be faster to use shared atomics than an equivalent operation using global atomics, on Maxwell or newer. That is an observation.

(Prior to Maxwell, shared atomics were implemented via a set of locks, controlling what effectively was a “critical section” created by the compiler. These were noticeably slower as indicated in the previously linked blog.)

1 Like

Aha. If you, Robert, are saying so, I will take that as authoritative (and take your reasoning tailored for the layman such as myself) back to my colleague. Atomics to __shared__ and GMEM are distinct processes, then, which had been my understanding earlier. I also appreciate the argument about __shared__ and L1 being conceptually distinct, albeit physically located on the same silicon. Accesses to __shared__ and other values that one can reasonable assume to have become resident in L1 have noticeable differences in performance, and my feeling is that the general L1 space is not sensitive to the memory bank conflicts that can dictate the best use of __shared__. (Feel free to comment here as well, but I have the answers I was looking for in the OP.)

I am more assured, overall, of the general design I was pursuing. Offline, some other conversations with my colleague have shown me ways to improve it further, although along the lines of philosophy and form rather than performance or applicability (we are breaking down Kahan summation to get at the most elegant way to use two 32-bit integers to accumulate floats in 64-bit fixed precision).

This is the memory hierarchy diagram built into nsight compute:

Now that is reassuring, to see how both host and device memory both get imported into L2 cache. Can the kernel perform atomic operations on memory addresses residing on the host in page-locked, cudaMalloc()'ed memory, at least under the assumption that the memory is not being simultaneously altered by the host? I can’t imagine that the kernel has a way to maintain coherence with host memory in that way, but if it takes that memory into K2 then an atomic operation at that level seems feasible.

You don’t use cudaMalloc for that.

depending on GPU architecture, atomics on host memory are possible, there are various stipulations and limitations, you can read about it in the atomic section in the programming guide.

somewhat more flexible atomics are also available in managed memory

sys mem accesses are not necessarily cached in L2 (yes, I understand that is potentially a confusing statement). The cache behavior of sys mem accesses is not documented anywhere (that I know of), so my statement is merely an observation based on personal benchmarking exercises.

Yikes… I meant cudaHostAlloc()… I have it right in my code!

I once made a program that stored all of its memory in cudaHostAlloc()'d arrays, not realizing that it was on the host and going up and down via the PCIE bus the whole time. The thing was still blazing fast, because the lifetime of each kernel launch was as much as 0.75 seconds and some NVIDIA dev techs analyzed it to be using as much as 80% of the card’s available FLOPs. Caching was definitely taking place, somewhere!