How does cuda global memory's L1 caching work

shawda.ssik600015 · July 11, 2024, 10:08am

In the programming guide, it says:

Data that is not read-only for the entire lifetime of the kernel cannot be cached in the unified
L1/texture cache for devices of compute capability 5.0. For devices of compute capability 5.2, it is,
by default, not cached in the unified L1/texture cache, but caching may be enabled using the following
mechanisms:
▶ Perform the read using inline assembly with the appropriate modifier as described in the PTX
reference manual;
▶ Compile with the -Xptxas -dlcm=ca compilation flag, in which case all reads are cached, except
reads that are performed using inline assembly with a modifier that disables caching;
▶ Compile with the -Xptxas -fscm=ca compilation flag, in which case all reads are cached, including reads that are performed using inline assembly regardless of the modifier used.

I am not sure whether I correctly understand the expression above. It seems to me that a cuda program will be compiled into a sequence full of strong operations if no flag is specified, because the compiler will not cache most of the variables in L1 cache, so at least those operations will be strong operations in the scope of CTA.

But in the pxt isa, it says:

The default load instruction cache operation is ld.ca, which allocates cache lines in
all levels (L1 and L2) with normal eviction policy. Global data is coherent at the L2
level, but multiple L1 caches are not coherent for global data.

Again, I am not sure if I misunderstood it, but it seems contradictory to what the programming guide says.

I write a simple demo program as below:

__global__ void test_kernel(int* data_input,int *out){
    out[0]=data_input[0]+1;   
}

int main(){
    test_kernel<<<1,1>>>(new int,new int);
}

First, I use the following command line arguments to compile the program and obtain the sass assembly:

nvcc -arch=compute_80 -code=sm_80 demo.cu -o demo.out --keep
nvdisasm demo.cubin > demo_default.sass

Then, I tried the ca flag argument:

nvcc -arch=compute_80 -code=sm_80 demo.cu -o demo.out -Xptxas -dlcm=ca --keep
nvdisasm demo.cubin > demo_ca.sass

Similarly, I created demo_cg.sass. And then I compare the three files generated, the only difference is as below:

LDG.E R2, [R2.64] ; //demo_default.sass
LDG.E.STRONG.SM R2, [R2.64] ; //demo_ca.sass
LDG.E.STRONG.GPU R2, [R2.64] ; //demo_cg.sass

This result is very confusing to me. It works as the very opposition to my first thought. These are my questions:

What is the real usage of “dlcm=ca”? I thought it should be used when I want my data to be cahced in L1, but it seems it works totally the opposite, when this flag is used, the load instruction is becoming “stronger”, not “weaker”.
How to understand ptx isa’s expression that “.ca” is the default operator? Actually, given the different operators provided in the ptx isa, I have never seen any of them being used in the generated ptx code. Does that mean all operations are “.ca” operations?

Robert_Crovella · July 11, 2024, 2:58pm

-Xptxas -dlcm=cg disables caching in the L1
-Xptxas -dlcm=ca enables caching in the L1

caching at all levels (what ca hint means) is the default behavior, at least for cc 8.0. It also means that for a PTX instruction like ld, which could have a caching modifer, if the cache hint is omitted, then it is as if the instruction were written as ld...ca1 2

It might be instructive to start with an arch that doesn’t have the “new” mnemonics. If I compile your code for sm_52 (default) and use cuobjdump -sass, I get:

    /*0028*/                   LDG.E R2, [R2] ;         /* 0xeed4200000070202 */ // default
    /*0028*/                   LDG.E R2, [R2] ;         /* 0xeed4200000070202 */ // -dlcm=ca
    /*0028*/                   LDG.E.CG R2, [R2] ;      /* 0xeed4600000070202 */ // -dlcm=cg

So this lines up with expectation. The default and the -dlcm=ca case are the same (L1 caching is enabled by default, and is disabled in the cg case via an alternate opcode that has .CG in disassembly)

Based on that I would conclude in the sm_80 case:

    /*0040*/                   LDG.E R2, [R2.64] ;                       /* 0x0000000402027981 */
                                                                         /* 0x000ea2000c1e1900 */ // default
    /*0040*/                   LDG.E.STRONG.SM R2, [R2.64] ;             /* 0x0000000402027981 */
                                                                         /* 0x000ea2000c1eb900 */ // -dlcm=ca
    /*0040*/                   LDG.E.STRONG.GPU R2, [R2.64] ;            /* 0x0000000402027981 */
                                                                         /* 0x000ea2000c1ef900 */  // -dlcm=cg

that from an L1 caching perspective, LDG.E and LDG.E.STRONG.SM are equivalent. Both indicate L1 caching enabled for global loads.

That’s because the -Xptxas -dlcm=... switch applies to code generation after the creation of PTX. It is an override that affects the ptxas tool behavior, which converts PTX to SASS. Regarding the creation/generation of PTX, that is done before that switch is applied. You can certainly write your own PTX that uses the .ca or .cg hints, and you should expect to see that result in the generated SASS code. If you want to see the ptx generated by CUDA C++ compiler (i.e. nvcc) modified, then use the offered options for that.

If your question is or becomes “what do strong and weak mean?” then I would refer you to the relevant section in the PTX guide. It’s rather involved material, and I wouldn’t assume that a statement like

adequately captures the meaning. I’m not suggesting I intend to try to demonstrate comprehension or explain it all, either.

If your question is or becomes “what is the exact difference between these two opcodes, then”:

    /*0040*/                   LDG.E R2, [R2.64] ;                       /* 0x0000000402027981 */
                                                                         /* 0x000ea2000c1e1900 */ // default
    /*0040*/                   LDG.E.STRONG.SM R2, [R2.64] ;             /* 0x0000000402027981 */
                                                                         /* 0x000ea2000c1eb900 */ // -dlcm=ca

I wouldn’t be able to answer that. SASS is not documented to that level.

Curefab · July 11, 2024, 9:30pm

Probably from an L1 caching perspective. Apart from that according to the PTX manual: *the .weak qualifier is assumed by default. The same seems to be true for SASS code. The .WEAK modifier is implicitly assumed.

shawda.ssik600015 · July 12, 2024, 2:35am

Thanks for your explanation. I still have one question, if ca is the default, then why did the programming guide say that? It says ca must be explicitly passed in the compilation argument, but actually it is not necessary at all?

Robert_Crovella · July 12, 2024, 3:33am

devices in the kepler, maxwell, and pascal generations had a complicated history with L1 caching. You can read about it in the tuning guides (e.g. here for maxwell and here for Pascal). The default behavior of the devices varied. For some of these devices, you had to explicitly opt-in to L1 caching for global loads (and some kepler device architectures basically did not allow it at all).

That’s the reason I said this:

(emphasis added)

Since cc7.0, AFAIK and can remember, L1 caching is enabled by default for global loads.

system · July 26, 2024, 3:33am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Maxwell (sm_50) instruction: LDG.E ? CUDA Programming and Performance	25	8554	August 15, 2015
Disabling cache on Fermi architectures Try to disable L1 and L2 CUDA Programming and Performance	11	9259	August 30, 2013
L1 Cache, L2 Cache and Shared memory in Fermi CUDA Programming and Performance	5	23533	March 21, 2011
L2 cache (.cg) memory load performance CUDA Programming and Performance	6	1688	January 5, 2017
Do 7.x devices have a readonly constant cache? CUDA Programming and Performance	4	1579	July 30, 2022
Understanding the functioning of nvprof and .cv data load option CUDA Programming and Performance	8	3071	December 11, 2014
Memory transaction size CUDA Programming and Performance	1	1731	February 12, 2017
Why L1 cache hit ratio become zero on K20? CUDA Programming and Performance	10	5625	January 17, 2013
__constant__ memory in function scope CUDA Programming and Performance	13	4605	June 1, 2015
Disabling L1 cache in visual studio CUDA Programming and Performance	8	837	February 11, 2021

How does cuda global memory's L1 caching work

Related topics