I had thought I had it straight–atomicAdd() and other such operations to addresses in __shared__
memory were faster because they could access the block’s own, explicitly allocated L1 memory and perform an atomic operation on it, then keep going. Atomic operations to global memory (GMEM), in contrast, would pull the address into a coherent L2 location and then perform the atomic operation on it, where other blocks could also get access to it. And my experience has been that atomic operations to __shared__
are about twice as fast, but only twice as fast, as those to global operations. I also believe I recall reading some official numbers to this effect.
But now, I’m having a conversation offline with a colleague, and he’s explaining to me that all atomic operations are performed in L2 cache–they bypass L1. Is that correct? If so, is it just the speed of going L1 => L2 and back to L1 that is saving the time, as opposed to GMEM => L2 => GMEM? We have been discussing the following __device__
function I wrote last month, and have been having some success with:
/// Add a floating point number to two integer accumulators, one intended to absorb the vast
/// majority of the work and the other kept in reserve in case the primary accumulator cannot
/// handle the value by itself.
///
/// \param fval Single-precision floating point number, pre-scaled to the fixed-precision range
/// \param pos Position in each accumulator array to contribute the results
/// \param primary The first accumulator, which receives most of the traffic
/// \param overflow The emergency accumulator
__device__ __forceinline__ void splitForceContribution(const float fval, const int pos,
int* primary, int* overflow) {
int ival;
if (fabsf(fval) >= max_int_accumulation_f) {
const int spillover = fval / max_int_accumulation_f;
ival = __float2int_rn(fval - ((float)(spillover) * max_int_accumulation_f));
atomicAdd(&overflow[pos], spillover);
}
else {
ival = __float2int_rn(fval);
}
const int prim_old = atomicAdd(&primary[pos], ival);
const int prim_old_plus_ival = prim_old + ival;
if ((prim_old ^ prim_old_plus_ival) < 0 && (prim_old ^ ival) >= 0) {
atomicAdd(&overflow[pos], (1 - (2 * (ival < 0))) * 2);
}
}
My colleague has pointed out some further improvements, but the central questions I’ve got for this forum are why I can be finding that this is faster (more than twice as fast, in terms of the mere operation, and also a significant performance boost in the context of an actual kernel calculating forces on particles) than doing int64 atomics (convert long long int
to unsigned long long int
, then add to a long long int
array recast as (unsigned long long int*)
). My colleague also mentioned “fire and forget” mechanics in atomics. I had wondered if such a thing existed, as it would be convenient to just heap the instructions into some extra queue and then let the thread continue processing. Apparently, these are a thing, and the fact that I am foregoing that advantage (which I would have if I were just accumulating in int64) to check the result of the primary accumulation (to see if I must add more to the secondary accumulator) makes it even more intriguing that I would see a performance improvement.
I had implemented this in the interest of conserving __shared__
memory, where I intended to do the accumulations in the low 32 bits (which, 99.999% of the time, will be all there is to do), while storing the secondary accumulators out in GMEM, seldom to be brought into play and therefore freeing up L1 cache. But my colleague’s observations have made me wonder how I’ve gotten away with what I did. Given what he says, I can definitely see why he didn’t pursue such routes years ago. I seem to be coming up with tricks that others know better than to attempt…