I am attempting to utilize the prefetch PTX instruction, and according to Nsight Compute, the compiled SASS code positions the prefetch(CCTL) before the LDG instruction. I have two questions regarding this:
Is the prefetch(CCTL) instruction asynchronous? Does the Streaming Multiprocessor (SM) wait for the prefetch operation to complete?
If the prefetch is placed before the LDG instruction and they are accessing different memory addresses, as illustrated in the diagram below, will the prefetch operation monopolize the memory bandwidth and consequently cause the LDG instruction to wait until the prefetch operation finishes before it starts reading from global memory?
I cannot say anything to your direct questions. However, I had rather mixed results with the prefetch instructions. In some cases they gave a small performance boost, but sometimes they made performance worse:
You cannot fully control the positioning of the prefetch instruction in code
The compiler and assembler often put load operations to the beginning anyway (especially when unrolling loops and there are enough registers available). IIRC quite a load of load requests can be in flight simultaneously.
With the scheduler switching to enough non-blocked warps you often effectively get the same result as using prefetch manually, it hides latencies
Nevertheless I am also waiting to hear for answers to your questions.
I’m attempting to utilize prefetching to mitigate warp stalls caused by a long scoreboard, but I’m puzzled. According to NVIDIA Nsight Compute, my L2 cache hit rate has improved by 50%, yet there is barely any reduction in warp stalls.
Pretty much all instructions in a CUDA GPU are asynchronous. The GPU does not “wait” for any of them to complete. However an instruction could introduce a dependency in a following operation. The prefetch op doesn’t introduce any dependencies that I know of. A subsequent memory read (for example) should get issued independent of the status of a previous prefetch op.
A prefetch op is a hint. What the GPU will do exactly is undefined, AFAIK, intentionally. But if the prefetch op is to have any meaningful significance, it is effectively a global load that does not actually target a register. So the behavior of the prefetch op, because it is undefined, is anywhere from a no-op to the behavior you would get if you issued a global load.
The answer to your question then is that the behavior could vary from no impact to the behavior you would get if you issued a global load followed by another global load (different addresses). The behavior of the memory pipe is not specified to a level of detail to tell you precisely how those two back-to-back global loads to different addresses flow through the pipe, and on top of that it would certainly depend on what else is going on.
Perhaps the warp stalls were not due to global load dependencies. Or perhaps the warp stall reasons have shifted from a global load dependency to something else. Or perhaps the difference in latency (~400 cycles, estimate, for fetch from DRAM to ~200 cycles, estimate (e.g. table 3.1), for fetch from L2 cache) is not enough to materially impact the warp stall reasons. Unhidden latency is unhidden latency, whether it is 400 cycles or 200 cycles. You might want to investigate exposing more work to the GPU, in addition to “optimizing” the utilization of the memory bus.
In addition to what @Curefab noted, one crucial issue with softwere-controlled prefetch is how far to fetch ahead. My experience with software-initiated prefetch is that the “optimal” prefetch distance is optimal for just one particular processor generation or even just one particular processor, and can be sub-optimal or even a performance detractor on others.
In other words, software-initiated prefetch is brittle performance-wise and therefore best avoided unless one is focused on one particular platform. Most processor families therefore ultimate adopt HW-controlled prefetching. Not sure where GPUs are in this regard.