PTX cp.async purpose of ignore-src and prefetch_size

I am working on a memory bandwidth and latency limited kernel. Using the C++ async memcopy pipeline mechanism already yields quite a speedup.

Looking at the docs for the underlying PTX instruction here, the hardware seems to be even more capable though, specifically regarding the prefetch to L2. I am also wondering what an example use case of ignore-src is. Is it meant to be used for prefetching without loading actual data?

For loading an array, the example pipelines from the Toolkit docs roughly do the following. This is a gross simplification, just to have a mental model:

// pseudocode, not runnable.
__global__ void example(int4[1024] data, int* result) {
    __shared__ s[2][32];  // 2 stage pipeline

    // kickstart the pipeline
    async copy data[0-32] to s[0] (one value per thread)
    for(int batch = 0; batch < 32; batch++) {
        // preload next batch
        async copy data[(0-32)*batch] to s[(batch-1)%2]
        // sync to previous batch and compute
        sync
        compute(s, result)
    }
}

Now my naive interpretation of the ptx instruction for cp.async.cg.shared.global would be, that we can tell the instruction to also prefetch some L2 content. The above implementation already loads 16*32 = 512 bytes, so the prefetch of the first 16 threads would be fully fetched already anyways.

Q1: Maybe this is where ignore-src comes into play: We can put two more async copies for the next-next batch after the existing one, that ignore the source and each prefetch 256 byte into L2 (512 in total, containting the next batch). This would require “victim buffers” to take the zero values produces by the load, not so elegant. Is that the intended use for ignore-src?

Q2: Alternatively, we could access data in a strided fashion, such that the addresses accessed for each batch are widely spread, and subsequent batches load from the respective neighbor addresses. Then the L2 prefetch would do its charm. Is that idea right?

Apart from that I am also interested in any other mechanisms that could help.

the description given for ignore-src is as follows:

The optional and non-immediate predicate argument ignore-src specifies whether the data from the source location src should be ignored completely. If the source data is ignored then zeros will be copied to destination dst. If the argument ignore-src is not specified then it defaults to False.

shared memory generally needs to be initialized before it can be used. The “allocation” of shared memory does not initialize it. This gives the CUDA programmer an option to initialize shared memory (to zero) without using foreground cycles to do so.

Thank you for the clarification. I was misunderstanding the wording, assuming the data path is still taken, just with different values:
“If the source data is ignored then zeros will be copied to destination dst”

Would the following description be also accurate?
“If the source data is ignored then the destination dst will be set to zeros, and neither access from nor prefetching to L1 or L2 will occur.”

Since you did not further comment on the L2 prefetch logic I outlined in Q2, can I assume that it’s sensible?

I don’t really understand the remainder of your post.

I was curious about the L2 prefetching mechanism exposed in ptx. In my original post I wronly assumed it was specific for memcopy_async. By now, I found this nvidia devblog entry, mentioning L2 prefetch just before the article summary. Turns out L2 prefetching is also available for normal loads.

Is there any further information available, like example usage and expected performance benefits of L2 prefetching? E.g. the performance benefits of memcopy_async are outlined in fig. 14 in the best practices guide:


The programming and best practices guides provide information regarding persistent L2 set aside cache, but not L2 prefetching with ptx as far as I can tell.

I would generally be interested if it is worth playing with these low level constructs, or if it would be better to leave generating ptx to nvcc.