Does the prefetch instruction delay the loading of the ld instruction?

YSAY · August 9, 2024, 11:35am

I am attempting to utilize the prefetch PTX instruction, and according to Nsight Compute, the compiled SASS code positions the prefetch(CCTL) before the LDG instruction. I have two questions regarding this:

Is the prefetch(CCTL) instruction asynchronous? Does the Streaming Multiprocessor (SM) wait for the prefetch operation to complete?
If the prefetch is placed before the LDG instruction and they are accessing different memory addresses, as illustrated in the diagram below, will the prefetch operation monopolize the memory bandwidth and consequently cause the LDG instruction to wait until the prefetch operation finishes before it starts reading from global memory?

Curefab · August 9, 2024, 12:10pm

I cannot say anything to your direct questions. However, I had rather mixed results with the prefetch instructions. In some cases they gave a small performance boost, but sometimes they made performance worse:

You cannot fully control the positioning of the prefetch instruction in code
The compiler and assembler often put load operations to the beginning anyway (especially when unrolling loops and there are enough registers available). IIRC quite a load of load requests can be in flight simultaneously.
With the scheduler switching to enough non-blocked warps you often effectively get the same result as using prefetch manually, it hides latencies

Nevertheless I am also waiting to hear for answers to your questions.

YSAY · August 9, 2024, 12:26pm

I’m attempting to utilize prefetching to mitigate warp stalls caused by a long scoreboard, but I’m puzzled. According to NVIDIA Nsight Compute, my L2 cache hit rate has improved by 50%, yet there is barely any reduction in warp stalls.

Robert_Crovella · August 9, 2024, 5:01pm

Pretty much all instructions in a CUDA GPU are asynchronous. The GPU does not “wait” for any of them to complete. However an instruction could introduce a dependency in a following operation. The prefetch op doesn’t introduce any dependencies that I know of. A subsequent memory read (for example) should get issued independent of the status of a previous prefetch op.

A prefetch op is a hint. What the GPU will do exactly is undefined, AFAIK, intentionally. But if the prefetch op is to have any meaningful significance, it is effectively a global load that does not actually target a register. So the behavior of the prefetch op, because it is undefined, is anywhere from a no-op to the behavior you would get if you issued a global load.

The answer to your question then is that the behavior could vary from no impact to the behavior you would get if you issued a global load followed by another global load (different addresses). The behavior of the memory pipe is not specified to a level of detail to tell you precisely how those two back-to-back global loads to different addresses flow through the pipe, and on top of that it would certainly depend on what else is going on.

Perhaps the warp stalls were not due to global load dependencies. Or perhaps the warp stall reasons have shifted from a global load dependency to something else. Or perhaps the difference in latency (~400 cycles, estimate, for fetch from DRAM to ~200 cycles, estimate (e.g. table 3.1), for fetch from L2 cache) is not enough to materially impact the warp stall reasons. Unhidden latency is unhidden latency, whether it is 400 cycles or 200 cycles. You might want to investigate exposing more work to the GPU, in addition to “optimizing” the utilization of the memory bus.

njuffa · August 9, 2024, 6:43pm

In addition to what @Curefab noted, one crucial issue with softwere-controlled prefetch is how far to fetch ahead. My experience with software-initiated prefetch is that the “optimal” prefetch distance is optimal for just one particular processor generation or even just one particular processor, and can be sub-optimal or even a performance detractor on others.

In other words, software-initiated prefetch is brittle performance-wise and therefore best avoided unless one is focused on one particular platform. Most processor families therefore ultimate adopt HW-controlled prefetching. Not sure where GPUs are in this regard.

system · August 23, 2024, 6:44pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Some issues regarding the use of prefetch in the cuda kernel CUDA Programming and Performance cuda , kernel	19	147	June 11, 2025
Do "prefetch" PTX instructions (CCTL) inherently include memory barriers? CUDA Programming and Performance cuda , llm	0	45	August 13, 2024
Improving gather kernel CUDA Programming and Performance	7	1051	September 1, 2022
Boosting Application Performance with GPU Memory Prefetching Technical Blog	7	1163	March 9, 2023
Tuning a kernel with LDG(ON/OFF,array) and prefetching CUDA Programming and Performance	12	19403	June 3, 2020
Latency Hiding Question CUDA Programming and Performance	2	1646	May 13, 2011
global memory prefetch is there any way ? CUDA Programming and Performance	8	6292	March 26, 2009
"Instruction Fetch" in Nsight Performance Analysis CUDA Programming and Performance	8	2514	January 7, 2016
How to use PTX prefetch.global with ASM? compiles but do not see prefetch instruction with cuobjdump CUDA Programming and Performance	7	5213	May 7, 2012
Basic question about hiding latency CUDA Programming and Performance	6	2133	July 9, 2014

Does the prefetch instruction delay the loading of the ld instruction?

Related topics