Design Pattern for Hiding Global Load Latency

Hi,

Just in case folks aren’t reading the sgemm thread, here’s a handy design pattern that can help performance:

Rather than doing:

loop (0..n) {

  shared = global[curr];

  __syncthreads();

  compute (shared);

  __syncthreads();

}

instead do the following to trade-off register pressure for latency hiding:

if (n > 0) {

  temp = global[first];

}

loop (0..n-1) {

  shared = temp;

  __syncthreads();

  temp = global[next];

  compute (shared);

  __syncthreads();

}

shared = temp;

__syncthreads();

compute (shared);

__syncthreads();

Your success with this approach will depend on the number of registers you require in your program – the less register pressure, the more likely and higher the compiler can hoist the prefetch (temp = global[next]) in the compute block. This means that while the fetch is executing, you can stay in the same block of execution and don’t need to context switch.

I’ve found this technique is particularly powerful when the # of blocks in your program isn’t terribly well tuned. Instead of needing two or three blocks to hide the global read, you now are using yourself to hide the read. It is still valuable to have more than one block, but the effect is greatly reduced.

For example, the most highly tuned kernel that Vasily and I have been yapping about gained ~4% from this technique. But when I dusted off one of my poorly tuned kernels, I got a 30% improvement – it was limited to one block.

Regards,

  • Paul

Interesting. Can you provide some more info? FOr example, for which threads is the condition “if (n > 0)” not true? What is n? How are first and next defined?

From reading his code, I believe the “if (n>0)” test is just required for correct handling of the case where n might be zero, not necessarily because any particular thread would have a different value for n.

I don’t think first/next matter much, I think the main thing they are doing here is causing the global memory read to go into a register rather than directly to shared memory, and they’re getting better overlapping of the global memory read with the other loop operations. In particular, the next global memory read is started before compute() runs, so they are maximizing the global memory latency hiding within the thread’s own computations. The bigger compute() is, the more this trick is likely to benefit the performance. The downside is that you have to have this “temp” register to store the incoming data, so it costs more registers than the version that goes straight to shared memory. Please let me know if I’ve misunderstood anything here.

Thanks. Since global loads can’t really write directly into shared memory, I’m not sure this would actually cost an extra register.

Mark

That’s a pretty good description John. I’ve got fix-up code in this example to handle the degenerate case of an empty loop and the last iteration, in order to prevent executing a fetch on a memory location that is outside the bounds of the problem. That may be unnecessary in cases where you know that reading a little beyond the bounds of an input array will still land in valid memory location (for example, if you pad the input array).

For a good example of this design pattern at work, see vvalkov’s latest sgemm kernel.

Short answer on the registers: worst case is the code degenerates into roughly the same register pressure as a reasonable implementation of the traditional load/sync/compute/sync model, but if enough registers are free, you get a lot more latency hiding. Long version follows…

Mark’s right about the register – a global read into a shared memory always requires an intermediate register. Let’s say we’re doing two global reads -> shared memory in the loop. The worst code (which, btw, PTXAS seems to generate more often than I’d like):

r0 <- g[idx0]

s[…] <- r0

r0 <- g[idx1]

s[…] <- r0

sync()

compute stuff…

sync()

The problem is that you have a big latency on the global read, which needs to be masked by another thread/block, followed by another big latency read, followed by a sync, then a computation. Because of the sync, you need multiple blocks to be scheduled in order to hide latency since by definition no thread within your block will be on the other side of the synchronization barrier.

If two registers are free, then you can change the second load to use r1, and initiate two global loads before blocking on the RAW dependency:

r0 <- g[idx0]

r1 <- g[idx1]

s[…] <- r0

s[…] <- r1

sync()

compute stuff…

sync()

But this still requires good scheduling and available blocks to hide latency. Moving to the suggested form in this post:

s[…] <- r0

s[…] <- r1

sync()

r0 <- g[idx0]

r1 <- g[idx1]

compute stuff…

sync()

Now, the global loads can be fetched while the compute stuff happens, all within the same block of threads. This allows a block to hide its own latency, rather than relying on other blocks. And it helps even if multiple blocks are available, since the length of the RAW stall that is exposed by the block is shorter.

The downside is that two registers must be free for the entire duration of the loop, while in the pre-sync version, they are only needed for a few instructions. So the method does increase register pressure. However, the worst case is the two registers are not available and so the compiler will re-order the loads and push them to the bottom of the loop:

s[…] <- r0

s[…] <- r1

sync()

compute stuff…

r0 <- g[idx0]

r1 <- g[idx1]

sync()

This code has the exact same register pressure as the second version of code. But by putting the loads on the same side of the sync as the computation, even this may help latency. In theory, some warps could hit the global loads before the sync, while other warps still have computation work to do. Still, you’ll be limited by the last load issued, but there is less likely to be competition for the GPU memory management since memory loads are spread out more over time… but that’s just a guess.

Regards,

Paul