Taking apart global atomics performance performance, graphs, theories

I’ll take your word for that…

This thread is very interesting! Anyone seen any updated reports for Fermi performance with atomics?

These aren’t solid numbers, but rather general impressions from use, so take them with a grain of salt.

  • red instructions are very, very fast when not under load; they just prepare and fire an instruction unless the memory units are saturated

  • Locking seems to happen per-cache-line in L2 (though it could be that a controller only has one atomic execution unit that only ever processes things serially, eliminating the need for locks; in that case, pipeline depth on that unit could explain this behavior)

  • Requests for the same cache line are serialized

  • Queues seem to be at the controllers, and are reasonably deep; under load, this slows atom way down, since it has to wait for the entire queue

  • Taken together, coalescing atom seems like a very bad idea, since the requests get serialized to one controller, and that controller’s queue fills up with non-parallel instructions. The average overall time required for red under load seems pretty much invariant on access pattern, although I expect individual times vary a lot.

  • Under load with cache thrash, all reduction types take about the same time (i.e. there is at least enough execution hardware at the controllers to keep up with memory bandwidth)

  • Shared-memory atomics work via LDSLK (load shared and lock) and STSUL (store shared and unlock), inside a loop that spins until all threads have successfully completed the transaction

  • Shmem locks seem to cover only a subset of the address line, so for example a lock for absolute shmem address 0 might also lock address 4096, 8192, etc.

  • If you’re really, really reckless, monkey-patching a cubin file to replace ordinary LDS and STS instructions with LDSLK and STSUL allows you to build your own very complicated shmem atomic primitives External Image

I’ll do another real analysis when Kepler comes out, so that I can be less misinformed than I probably am now.

monkey-patching - I like that word.