Some issues regarding the use of prefetch in the cuda kernel

__device__ __forceinline__
void prefetch_64(const void *ptr, uint32_t ret_0) {

    asm volatile (
        "ld.global.L2::64B.b32 %0, [%1];"
        : "=r"(ret_0)
        : "l"(ptr)
    );

    return;
}
// 定义预取函数,使用 prefetch.global.L2::evict_last 指令
__device__ __forceinline__
void prefetch_l2(const void *ptr) {
    // 使用 prefetch.global.L2::evict_last 指令预取数据到 L2 缓存
    // 并指定 L2 的逐出策略为 evict_last (驱逐最近最少使用的数据)
    asm volatile (
        "prefetch.global.L2 [%0];"
        : // 没有输出操作数
        : "l"(ptr) // 输入操作数,%0 对应 ptr, "l" 表示内存地址
    );
}
    int start = img_idx + tid/32 * W * C + tid%32 * C;                              
    uint32_t dummy;  // 虚拟变量
    for(int pre_num=0; pre_num<BLOCK_H/(tid/32);pre_num++){
    prefetch_64(&inp[start + pre_num * 4 *W *C],dummy);
    }
    //load GM to SM

    //img 16*16*32
    //thread 32*4
    uint32_t img_frag[16*2];
    int imgm;
    int shm_bank;


    int iter = BLOCK_H*BLOCK_W * C / (produce_num * 8);

    #pragma unroll
    for(int i = 0; i<iter; i++){
        //所在img位置+所在block位置+所在warp位置+所在thread位置
        imgm = img_idx + i * 2 * W * C + warp_id/2 * W * C + tid%64 * 8;
        int bs_gm = imgm/(H*W*C);
        int h_gm = imgm%(H*W*C)/(W*C);
        int w_gm = imgm%(H*W*C)%(W*C)/C;
        if(bs_gm<BS && h_gm>=h_idx && h_gm<h_idx+block_h && w_gm>=w_idx && w_gm<w_idx+block_w){
        *((float4*)&img_frag[0 + i*4]) = *((float4*)&inp[imgm]);
        //ldg_cg_v4(&inp[imgm],img_frag[0 + i*4],img_frag[1 + i*4],img_frag[2 + i*4],img_frag[3 + i*4]);
        shm_bank = swizzle(tid*8 + i * 32 * 32);
        *((float4*)&img_shm[shm_bank]) = *(float4*)(&img_frag[0 + i*4]);
        }

    }

In this data loading code, I attempted to use the prefetch instruction to accelerate the data loading because the access pattern of the data is determined, and all the data loaded by sm in one wave can be placed in the L2cache. I checked the PTX documentation and found that there are the above two prefetch instructions. By testing these two instructions and comparing them with not using them, I found that the first instruction had no change to L2cache hit, and the data loading time also remained unchanged. Moreover, when I examined the PTX code and SASS code of this instruction, I discovered that this instruction existed at the PTX level. However, at the sass level, the prefetched sass instruction was not seen, and its corresponding sass code is LDG.E.U16.SYS R5, [R2]; In the experiment of the second prefetch instruction, I found that L2cache hit increased by 15% to 87%, but the data loading time became higher, which puzzled me a lot. When I checked the sass code of this instruction, I found that it was indeed the prefetch instruction CCTL.E.PF2 [R2]. This phenomenon puzzles me a lot. I want to know why. Please help me.

When posting code on these forums, please format it correctly. One method to do so is the following:

  • click the pencil icon below your post to edit your post
  • select the code
  • press the </> button at the top of the edit window
  • save your changes

Please do that now.

The function prototype you have:

does not permit modification of the ret_0 variable in the calling space. Therefore a sensible option by the compiler would be to omit that instruction altogether. I’ll also excerpt some text from the PTX manual:

The .level::prefetch_size qualifier is a hint to fetch additional data of the specified size into the respective cache level.The sub-qualifier prefetch_size can be set to either of 64B, 128B, 256B thereby allowing the prefetch size to be 64 Bytes, 128 Bytes or 256 Bytes respectively.

The qualifier .level::prefetch_size may only be used with .global state space and with generic addressing where the address points to .global state space. If the generic address does not fall within the address window of the global memory, then the prefetching behavior is undefined.

The .level::prefetch_size qualifier is treated as a performance hint only.

The use of “additional data” suggests to me that this instruction variant is not purely a prefetch operation. Just like any other ld instruction, it loads the designated register. Since (the data in) that designated register does not persist beyond the scope of that function, and that function does nothing with the fetched data, it does not surprise me that that PTX instruction would not survive SASS compilation.

With respect to the “additional” prefetch behavior, I note the use of the word “hint” multiple times in the description. By my read of things, the use of the word “hint” in compilation flow is less stringent than language features which must be respected. “hint” strongly suggests that the described behavior is optional by the compiler, and may be discarded without violating correctness or programmer intent. So the compiler would appear to have latitude to simply forgo the prefetching behavior, and it may have decided that the ability to optimize-out the instruction per the previous reasoning I provided may be important enough to forgo the hint.

There is of course no way I can reason out compiler behavior like this to a certitude, I’m simply offering what I consider to be a plausible explanation (a “guess”) for why the instruction may have “disappeared” at the SASS level. (And, as an aside, the proposed SASS instruction you found does not look to me like an actual instantiation of this PTX instruction, but I may be mistaken there, and the provided code is not compilable, so I cannot inspect it that way.)

With respect to the second line of inquiry

it would be rather difficult in my opinion to sort out code behavior based on presence or absence of a single instruction. Moreover, things like machine architecture may matter, and certainly I cannot make much headway on performance problems at this level without compilable code. Even with compilable code, my own time to spend on such an item might be quite limited, however this is a community of course (not just me; I’m speaking for myself.) The profilers (e.g. nsight compute) may be useful tools to judge the origin of performance observations.

I do not want to discourage experimentation with software-defined prefetch, but here are some general observation from personal historical experience.

(1) Software-defined prefetch is weak sauce. Best case it usually helps a little bit on the processor architecture where it was introduced and for which it was intended, but it often becomes meaningless to counter-productive in future architectures (where such instructions may even be actively ignored by hardware for that reason). One aspect is that optimal prefetch distance varies with specifics of the memory subsystems and there is no good way to determine it statically at compile time and across different architectures. Hardware prefetching mechanisms can do much better as they can incorporate runtime feedback (access patterns, prefetch efficacy) and adjust accordingly.

(2) When something is labelled as a user-provided hint in a software specification, there is 90% chance that it is simply ignored by implementations. When it comes to compilation specifically: From conversations I have had with compiler engineers in the past, there is usually a good reason for ignoring hints: the sophistication of modern compilers (available code transformations in conjunction with heuristics) largely exceeds the sophistication of most programmers, so in most situations better performance is achieved by just letting the compiler do its thing. Exceptions are possible, of course. Compilers usually can eek out a few more percent if they support profile-guided optimizations, i.e. a form of deferred run-time information, but to my knowledge this is not a feature presently offered by the CUDA compiler. Some recent data I came across (sorry, cannot find the source any more) suggests that the benefit of profile-guided optimizations tends to be small (low single-digit percentages) these days.

Thank you for answering my question. I can understand what you said. Regarding the first point, I also guess that the compiler ignored my prompt. The second point is that I used nsight compute to analyze whether this prefetching instruction was added or not (the two experiments are only the only difference between adding prefetching and not adding prefetching). Then, when checking the memory analysis results, it was found that with the addition of prefetch instructions, the L2cache increased by 16%, while the program ran approximately 7% slower.

I can understand what you said. Effective prefetching optimization is indeed very difficult. I have rarely seen effective prefetching optimization either. It is also very difficult for me to find performance tests about prefetching on the Internet. I have only seen a related article on the Technical Blog of nvdia. The GPU I am using is T4. I wanted to experiment with the prefetching optimization in this article. Therefore, I went through the PTX documentation and found two prefetching instructions that this GPU could use. Of course, I wanted to experiment with the prompts for the L1 or L2cache because the data access pattern was definite. But unfortunately, my experiment did not obtain the desired result. I would like to ask you where you can learn about this kind of lower-level knowledge close to hardware GPU optimization.

Join NVIDIA’s GPU architecture group. Learn how the hardware works internally. After a few years, switch to NVIDIA’s DevTech organization. Now you can apply your knowledge of the GPU’s internal workings (only small parts of which are publicly documented) to help customers tune their software.

I am not trying to be facetious. When I worked at NVIDIA (more than a decade ago), my go-to person for questions on memory optimization was a senior DevTech engineer whose professional career in part matched what I outlined above.

With sufficient time and energy for experiments and creating one’s own SASS-level development tools one could presumably approximate this level of expertise outside of NVIDIA, and there are a few people who have done this kind of work, but as I recall they were largely focused on optimizing the compute-bound portion as opposed to memory-bound portion of CUDA programs.

Prefetching is a lot about the timing of the prefetch. Experiment with loop unrolling factors.

But keep in mind that the philosophy of Nvidia GPUs (and similar accelerators) is that it should not make much difference:

Nvidia GPUs are all about bandwidth and less about latency. Latency is hidden by many threads waiting to be selected. So the actual memory operation (if used without prefetching) and afterwards waiting until the data arrives often is as good as prefetching. Additionally the compiler moves the memory instructions early far before the dependencies on the read value. So the threads can even do useful things similar as with prefetching. Memory operations are asynchronous.

One actual difference of prefetching is that it can span loop iterations. However, even that could be simulated in plain C/C++ by reading the data from the next iteration into local variables.

So software prefetching is difficult to do well, the Cuda architecture is optimized to not needing prefetching, normal memory operations typically act similar to prefetching, the compiler optimizes in a way considering all that.

It is seldom worth it even to try.

Reasons to try nevertheless would be, if you are short on threads, which can be selected, and existing ones wait a lot for memory reads to arrive.

Also, if you want to optimize for latency or have some reasons not to unroll loops.

Sometimes also just to try out, whether you can get another 0.5% or 1% optimization with a specific GPU after optimizing everything else.

Thank you for your answer. I have deeply understood the concept that the GPU pays more attention to bandwidth rather than latency. Indeed, both the switching of blocks and the switching between warps can confirm this concept. When writing CUDA, we only need to provide sufficient tasks to make good use of the GPU.

Thank you for your suggestion. It sounds great, but it’s too difficult for me.
Yes, focusing on optimizing the computationally intensive is quite correct. I’m just trying to improve the better cache utilization rate. I think this can help me better allocate data to different blocks.

There are some more architecture details, helping with it:

  • The SMs have a large register file
  • Memory operations are asynchronous and the wait is on the instruction using the result, not on the read (and definitely not on the write)
  • The compiler tries to put reads into the beginning of loops and the usage of the data at the end of loops

Those together also create a prefetching-like functionality and help with the bandwidth being the (only) relevant parameter.

Regarding the second point, I would like to ask if this asynchronous memory operation is supported by cp.async only in the SM80 architecture and above.

It is true that with 8.0=Ampere a new asynchronous engine was introduced to copy from global memory to shared memory.

But what I meant was that for even earlier compute capabilities each simple memory instruction does not block the threads waiting for its successful completion, but Is executed asynchronously in the background. Only dependent instructions using the result of a read, wait

I have a little doubt about this point. I once read such a sentence. A CUDA GPU is not an out-of-order machine. All instructions are issued in the order in which the compiler has created. You only need to inspect the compiled code. The GPU scheduler will not reorder instructions with respect to a particular thread or warp. Instructions are issued per-warp, and there is also no guarantee that the warp scheduler will issue the second CS2R instruction immediately after (i.e. in the next available issue slot) the LDG instruction. It may issue other instructions, from other eligible warps, in between the LDG instruction and the CS2R instruction, for a given warp.
From the perspective of a single warp, its instructions are all sent sequentially, right? For example, for the data loading instruction, during the waiting process of its instruction execution, the next instruction will be another instruction of the current warp that is not dependent on this data or a prepared instruction of another active warp.

That is right. The instructions are issued in order. But that does not mean that an instruction has to be finished for the next one to be issued.

First there are some fixed-length, e.g. arithmetic pipelines local to a SM Partition. They can run several non-dependent instructions in various states of completion at the same time from the same or different warps. The compiler can predict, when the result is ready after issueing.

And second, there are variable length pipelines like the MIO pipeline, which is shared by all 4 SMPs of a SM. The SMP waits asynchronously for the results and can issue non-dependent instructions from the same warp in between. The short and long scoreboard keep track of those dependencies. The results are asynchronously stored into the register file.

Thank you for your reply. It has enabled me to have a better understanding of the underlying principles of GPU. From the perspective of a single warp, the transmission of instructions is achieved by placing all instructions without data dependencies at the front during compilation, and then the warp switches the transmission to keep the instruction transmission busy, right? The pipeline execution of instructions is the asynchronous implementation of all warp within SM.

Regarding what you said, But what I meant was that for even earlier compute capabilities each simple memory instruction does not block the threads waiting for its successful completion, but Is executed asynchronously in the background. Only dependent instructions using the result of a read, wait.

Is there any difference between this asynchronous copy and cp.async asynchronous copy

I think you got the understanding right.
The transmission of instructions I would call scheduling and issuance of instructions.

(There is also an actual reading of the instructions themselves with two levels of instruction caches, which could be called transmission.)

There are two differences of cp.async

  • cp.async can do several copies (“in a loop”), not just a single 4/8/16 byte element
  • the data is stored into shared memory instead of into registers

You can get a similar result by dedicating one or a few warps of your kernel blocks to copying.

Thank you for your answer. Yes, what I said earlier was “launch instructions”.

The cp.async features you mentioned: The first one is cp.async.bulk or a potential copy merge of 16 bytes when the address is aligned.

The second step is to reduce the intermediate cache registers and merge ldg and sts to decrease the number of instructions for LDGSTS. These are all achieved by using larger and higher merged instructions to reduce the number of instructions. This can be reflected in the subsequent large block data replication and larger block MMA computations. It seems that the development architecture of Gpus is increasingly loading and computing large blocks of basic data. The programming level is also getting higher and higher.

What I want to ask is whether there is any difference in asynchrzation between this asynchronous copy and the cp.async asynchronous copy. Because this instruction has an explicit async modifier, while the ld instruction doesn’t, I’m very curious if there is any difference in the async of cp.async.

(You are right, only the bulk variant copies several elements.)

One difference between cp.async and normal loads is the means of “re-synchronization” of the asynchronous operations.

For normal loads from global memory, the re-synchronization works with the help of the long scoreboard, as soon as the result registers are used in instructions dependent on the result.

For cp.async, the results are transferred into shared memory. You have to use cp.async.wait_group or cp.async.wait_all for re-synchronization.

Also the ordering requirements of the memory transfers (not of the instructions themselves) is weaker for the cp.async instruction.

Thank you for your answer. I think I understand now. One is the automatic synchronization of hardware and the other is the manual synchronization of software.