In the programming guide, it says:
Data that is not read-only for the entire lifetime of the kernel cannot be cached in the unified
L1/texture cache for devices of compute capability 5.0. For devices of compute capability 5.2, it is,
by default, not cached in the unified L1/texture cache, but caching may be enabled using the following
mechanisms:
▶ Perform the read using inline assembly with the appropriate modifier as described in the PTX
reference manual;
▶ Compile with the -Xptxas -dlcm=ca compilation flag, in which case all reads are cached, except
reads that are performed using inline assembly with a modifier that disables caching;
▶ Compile with the -Xptxas -fscm=ca compilation flag, in which case all reads are cached, including reads that are performed using inline assembly regardless of the modifier used.
I am not sure whether I correctly understand the expression above. It seems to me that a cuda program will be compiled into a sequence full of strong operations if no flag is specified, because the compiler will not cache most of the variables in L1 cache, so at least those operations will be strong operations in the scope of CTA.
But in the pxt isa, it says:
The default load instruction cache operation is ld.ca, which allocates cache lines in
all levels (L1 and L2) with normal eviction policy. Global data is coherent at the L2
level, but multiple L1 caches are not coherent for global data.
Again, I am not sure if I misunderstood it, but it seems contradictory to what the programming guide says.
I write a simple demo program as below:
__global__ void test_kernel(int* data_input,int *out){
out[0]=data_input[0]+1;
}
int main(){
test_kernel<<<1,1>>>(new int,new int);
}
First, I use the following command line arguments to compile the program and obtain the sass assembly:
nvcc -arch=compute_80 -code=sm_80 demo.cu -o demo.out --keep
nvdisasm demo.cubin > demo_default.sass
Then, I tried the ca
flag argument:
nvcc -arch=compute_80 -code=sm_80 demo.cu -o demo.out -Xptxas -dlcm=ca --keep
nvdisasm demo.cubin > demo_ca.sass
Similarly, I created demo_cg.sass
. And then I compare the three files generated, the only difference is as below:
LDG.E R2, [R2.64] ; //demo_default.sass
LDG.E.STRONG.SM R2, [R2.64] ; //demo_ca.sass
LDG.E.STRONG.GPU R2, [R2.64] ; //demo_cg.sass
This result is very confusing to me. It works as the very opposition to my first thought. These are my questions:
- What is the real usage of “dlcm=ca”? I thought it should be used when I want my data to be cahced in L1, but it seems it works totally the opposite, when this flag is used, the load instruction is becoming “stronger”, not “weaker”.
- How to understand ptx isa’s expression that “.ca” is the default operator? Actually, given the different operators provided in the ptx isa, I have never seen any of them being used in the generated ptx code. Does that mean all operations are “.ca” operations?