I am exploring the performance improvements of L2 residency control in CUDA. I came across the following blog post which I tested on my own device (RTX 3070 Mobile) with 4MB L2 cache size. I observed similar performance improvementsin the runtime and also saw reduced DRAM read traffic on Nsight Compute (used the command ncu --cache-control none --metrics dram__bytes_read.sum,dram__bytes_write.sum <program>
), on making the lut_persistent
array L2 resident . For reference, I used the following kernel:
__global__ void reset_data(int* data_streaming, int const* lut_persistent,
size_t data_streaming_size,
size_t lut_persistent_size)
{
size_t const idx{blockDim.x * blockIdx.x + threadIdx.x};
size_t const stride{blockDim.x * gridDim.x};
for (size_t i{idx}; i < data_streaming_size; i += stride)
{
data_streaming[i] = lut_persistent[i % lut_persistent_size];
}
}
However, the accessPolicyWindow method doesn’t offer fine grained control of L2 persistence windows (which I require for my actual usecase). Therefore, I tried using the cuda::apply_access_property as a substitute but I no longer see any performance improvements, neither in runtime nor the DRAM read traffic. I used the following kernel for pinning to the L2 cache.
__global__ void pin(int* ptr, size_t N) {
auto g = cooperative_groups::this_grid();
for (int idx = g.thread_rank(); idx < N; idx += g.size()) {
cuda::apply_access_property(ptr + idx, sizeof(int), cuda::access_property::persisting{});
}
}
As an alternative, I also tested the prefetch PTX instruction that cuda::apply_access_property uses internally as follows:
# define L2_LINE_SIZE 128
__global__ void set_l2_persistence_ptx(void* base_ptr, size_t num_bytes){
for (size_t i = 0; i < num_bytes; i+=L2_LINE_SIZE) {
asm volatile ("prefetch.global.L2::evict_last [%0];" ::"l"((uint8_t*)base_ptr + i) :);
}
}
However, I still don’t see any improvements.
Would really appreciate it if someone with more experience with L2 residency control could explain my observations.