I hate nagging questions, so I finally went back to this problem to see what was up with write speeds especially with Christian’s shared memory variant.
In particular, my previous experiment in this thread measured various write strategies for global memory, but we’re all expecting shared memory to be faster and perhaps have different memory write sizes (does writing a byte affect neighbor bytes in a race?)
So I repeated my tests using shared memory writes. I compared all the methods for shared memory writing, and then used the same test set to measure global memory methods (with one block running, so minimal bandwidth issues), and finally “busy global”, where many blocks are writing at once so bandwidth matters.
Here clock() counts are useful and keep measurements quite accurate and repeatable. [Thanks for including that, NV guys!]
The transaction results for shared memory show that just as with global memory, writing one byte is always safe even in the presence of other simultaneous writes from other threads to the neighbor bytes. Writing bits is unsafe, which is expected and again just like global memory.
Timing wise, we see some pleasant surprises, mostly in the quite nice speed of global writes when compared to shared writes.
Below is timing for a sieve using different write strategies. The absolute times are not important, just the relative speeds between different methods.
Method KClocks KClocks KClocks
Shared Global BusyGlobal
Atomic Set 682 303 4256
Atomic Or 637 863 13845
Write Word 204 302 409
Write Byte 187 271 366
Write Bit 320 1287 2407
R/W Bit 320 1291 2405
Some notes:
Shared memory atomics are cheap, only about 3X the cost of a normal shared write. This is in a test that has constant contention (every thread is writing as fast as it can), but the writes are spread out effectively randomly over 4K addresses. I’m sure speeds would be lower if they were all pounding just a few addresses.
Unlike global atomics, it’s slightly cheaper to use a shared atomic-or than a shared atomic exchange.
Writing one bit (as expected) causes races and is slower, since it’s implemented as a read-mask-write set of operations.
It is surprising but writing to global memory is only about 1.5X slower than writing to shared memory. When we think of “shared memory is fast!” we’re thinking of bandwidth. But the actual operation throughput is pretty comparable if you’re not at the bandwidth limit. I really am impressed by the NV engineers on this one. In my head I was expecting a 10X difference or something. Tim Murray once said that the global writes are fire-and-forget and therefore have high throughput. This test shows it’s completely true.
When the bus is not busy, a global atomic exchange has identical throughput to a normal global write! This speed of both normal writes and atomics collapses when the global memory bus is busy and you have a lot of SMs all writing at once, as you can see in the last column. And as you might expect, global atomics are more sensitive to the congestion. So the rough guideline of “go ahead and use atomics sparsely” is well supported.
How do these results affect something like a sieve with its sparse writes? Should you still do some pre-staging in shared memory? The answer is likely yes, it will be faster to, but not in some huge order of magnitude way. The other takeaway is to see that even high-contention shared atomics are pretty efficient. I always stayed away from them since doing reductions or compactions with the traditional parallel methods is so ingrained, but maybe I’ll use them more now. Shared atomics are compute 1.2+ only, however, so that lower portability is probably the big negative to them.