Disabling cache and positive L1 throughput

I did a test for disabling and enabling L1 cache with the following nvcc options:

Disable: -Xptxas -O3,-v,-dlcm=cg
Default: -Xptxas -O3,-v

However, in the profiling summary for the “disable” run, I still see some positive L1/tex throughput.

Disable:
image

Default:
image

That 50% utilization for -dlcm=cg is somewhat confusing here. Any thoughts on that?

Does your program write a lot? AFAIK dlcm only changes loads.

Alternatively L1 still transfers the loaded data, but does not cache it.

It’s the simple matrix multiplication kernel

  int row = blockIdx.y * blockDim.y + threadIdx.y;
  int col = blockIdx.x * blockDim.x + threadIdx.x;
  

  if (row < width && col < width) {
    float sum = 0.0;
    for (int k = 0; k < width; k++) {
      sum += v1[row*width+k] * v2[k*width+col];
    }
    v3[row*width+col] = sum;
  }

Perhaps some more knowledgeable from Nvidia answers.

Until then you could try it with only fews rows and cols and high width, so that input >> output.

What does the memory section show? (You only pasted the summary page.)

I think one possibility is that the compiler is discovering/deciding that some of your input data is constant/read-only, and is loading it through the read-only cache mechanism. This would be something like a LDG.E.CONSTANT instruction in SASS (not the same as __constant__, i.e. LDC/ULDC instruction, constant memory space.)

This instruction/path uses L1/TEX hardware but is unaffected by -Xptxas -dlcm=cg

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.