That’s a pretty good description John. I’ve got fix-up code in this example to handle the degenerate case of an empty loop and the last iteration, in order to prevent executing a fetch on a memory location that is outside the bounds of the problem. That may be unnecessary in cases where you know that reading a little beyond the bounds of an input array will still land in valid memory location (for example, if you pad the input array).
For a good example of this design pattern at work, see vvalkov’s latest sgemm kernel.
Short answer on the registers: worst case is the code degenerates into roughly the same register pressure as a reasonable implementation of the traditional load/sync/compute/sync model, but if enough registers are free, you get a lot more latency hiding. Long version follows…
Mark’s right about the register – a global read into a shared memory always requires an intermediate register. Let’s say we’re doing two global reads → shared memory in the loop. The worst code (which, btw, PTXAS seems to generate more often than I’d like):
r0 ← g[idx0]
s[…] ← r0
r0 ← g[idx1]
s[…] ← r0
sync()
compute stuff…
sync()
The problem is that you have a big latency on the global read, which needs to be masked by another thread/block, followed by another big latency read, followed by a sync, then a computation. Because of the sync, you need multiple blocks to be scheduled in order to hide latency since by definition no thread within your block will be on the other side of the synchronization barrier.
If two registers are free, then you can change the second load to use r1, and initiate two global loads before blocking on the RAW dependency:
r0 ← g[idx0]
r1 ← g[idx1]
s[…] ← r0
s[…] ← r1
sync()
compute stuff…
sync()
But this still requires good scheduling and available blocks to hide latency. Moving to the suggested form in this post:
s[…] ← r0
s[…] ← r1
sync()
r0 ← g[idx0]
r1 ← g[idx1]
compute stuff…
sync()
Now, the global loads can be fetched while the compute stuff happens, all within the same block of threads. This allows a block to hide its own latency, rather than relying on other blocks. And it helps even if multiple blocks are available, since the length of the RAW stall that is exposed by the block is shorter.
The downside is that two registers must be free for the entire duration of the loop, while in the pre-sync version, they are only needed for a few instructions. So the method does increase register pressure. However, the worst case is the two registers are not available and so the compiler will re-order the loads and push them to the bottom of the loop:
s[…] ← r0
s[…] ← r1
sync()
compute stuff…
r0 ← g[idx0]
r1 ← g[idx1]
sync()
This code has the exact same register pressure as the second version of code. But by putting the loads on the same side of the sync as the computation, even this may help latency. In theory, some warps could hit the global loads before the sync, while other warps still have computation work to do. Still, you’ll be limited by the last load issued, but there is less likely to be competition for the GPU memory management since memory loads are spread out more over time… but that’s just a guess.
Regards,
Paul