I am working on a memory bandwidth and latency limited kernel. Using the C++ async memcopy pipeline mechanism already yields quite a speedup.
Looking at the docs for the underlying PTX instruction here, the hardware seems to be even more capable though, specifically regarding the prefetch to L2. I am also wondering what an example use case of ignore-src
is. Is it meant to be used for prefetching without loading actual data?
For loading an array, the example pipelines from the Toolkit docs roughly do the following. This is a gross simplification, just to have a mental model:
// pseudocode, not runnable.
__global__ void example(int4[1024] data, int* result) {
__shared__ s[2][32]; // 2 stage pipeline
// kickstart the pipeline
async copy data[0-32] to s[0] (one value per thread)
for(int batch = 0; batch < 32; batch++) {
// preload next batch
async copy data[(0-32)*batch] to s[(batch-1)%2]
// sync to previous batch and compute
sync
compute(s, result)
}
}
Now my naive interpretation of the ptx instruction for cp.async.cg.shared.global
would be, that we can tell the instruction to also prefetch some L2 content. The above implementation already loads 16*32 = 512 bytes, so the prefetch of the first 16 threads would be fully fetched already anyways.
Q1: Maybe this is where ignore-src
comes into play: We can put two more async copies for the next-next batch after the existing one, that ignore the source and each prefetch 256 byte into L2 (512 in total, containting the next batch). This would require “victim buffers” to take the zero values produces by the load, not so elegant. Is that the intended use for ignore-src
?
Q2: Alternatively, we could access data
in a strided fashion, such that the addresses accessed for each batch are widely spread, and subsequent batches load from the respective neighbor addresses. Then the L2 prefetch would do its charm. Is that idea right?
Apart from that I am also interested in any other mechanisms that could help.