CUDA L2 residency control

I am exploring the performance improvements of L2 residency control in CUDA. I came across the following blog post which I tested on my own device (RTX 3070 Mobile) with 4MB L2 cache size. I observed similar performance improvementsin the runtime and also saw reduced DRAM read traffic on Nsight Compute (used the command ncu --cache-control none --metrics dram__bytes_read.sum,dram__bytes_write.sum <program>), on making the lut_persistent array L2 resident . For reference, I used the following kernel:

__global__ void reset_data(int* data_streaming, int const* lut_persistent,
                           size_t data_streaming_size,
                           size_t lut_persistent_size)
{
    size_t const idx{blockDim.x * blockIdx.x + threadIdx.x};
    size_t const stride{blockDim.x * gridDim.x};
    for (size_t i{idx}; i < data_streaming_size; i += stride)
    {
        data_streaming[i] = lut_persistent[i % lut_persistent_size];
    }
}

However, the accessPolicyWindow method doesn’t offer fine grained control of L2 persistence windows (which I require for my actual usecase). Therefore, I tried using the cuda::apply_access_property as a substitute but I no longer see any performance improvements, neither in runtime nor the DRAM read traffic. I used the following kernel for pinning to the L2 cache.

__global__ void pin(int* ptr, size_t N) {
    auto g = cooperative_groups::this_grid();
    for (int idx = g.thread_rank(); idx < N; idx += g.size()) {
        cuda::apply_access_property(ptr + idx, sizeof(int), cuda::access_property::persisting{});
    }
}

As an alternative, I also tested the prefetch PTX instruction that cuda::apply_access_property uses internally as follows:

# define L2_LINE_SIZE 128

__global__ void set_l2_persistence_ptx(void* base_ptr, size_t num_bytes){
    for (size_t i = 0; i < num_bytes; i+=L2_LINE_SIZE) {
        asm volatile ("prefetch.global.L2::evict_last [%0];" ::"l"((uint8_t*)base_ptr + i) :);
    }
}

However, I still don’t see any improvements.
Would really appreciate it if someone with more experience with L2 residency control could explain my observations.

What is your usecase?

I am trying to see if I get performance improvements in DL inference applications. This is in part inspired by AutoScratch from MLSys 2023, where the authors cache parts of the activation buffer in L2. I am trying to explore the possible performance improvements on caching the weights as well (for eg. weights of fully connected or convolution layers).
I also had a followup about L2 residency control using accessPolicyWindow approach for the following kernel as well. This isn’t really a matmul kernel but a modified version of it.

__global__ void kernel(int const *weightMatrix, int* outputMatrix) {
    const int row = blockIdx.y * blockDim.y + threadIdx.y;
    const int col = blockIdx.x * blockDim.x + threadIdx.x;

    size_t idx = row * N + col;;
    size_t stride = N * N;

    for (size_t i{idx}; i < stride*batch_size; i += stride){    
        int val = 0;
        for (int k = 0; k < N; k++) {
            val += weightMatrix[row * N + k]* weightMatrix[k * N + col];
        }
        outputMatrix[i] = val;
    }
    
}

I made the weightMatrix L2 resident expecting to see reduced DRAM read traffic, but there was virtually no difference. Another strange thing I observed was that as I reduced the range of k (i.e in for (int k = 0; k < N; k++) , I started seeing performance improvements in terms of runtime as well as DRAM read traffic, when compared against no L2 residency control. For eg. for (int k = 0; k < 1; k++) shows the most improvement and for (int k = 0; k < N; k++) shows no difference. The DRAM bytes read also grows proportional to the range. I didn’t expect this as weightMatrix is supposed to be L2 resident and so reads inside the for loop shouldn’t really affect the DRAM read traffic.