Understanding Caching/Flushing Behavior/Performance in computeprof for Kepler

Hi everyone,

UPDATED: Attached actual .cu and .xls files. Removed partial computeprof results as a result, leaving code segment in case people find that easier to look through.

I’m running my experiments on a GTX 680 (Kepler-class) GPU with CUDA 5.0. Specifically, I’m running some small microbenchmarks to test the caching/flushing for Kepler between kernels. The kernel I’m running is listed below. All it does it increment a location in a global array per thread in each of the 2 kernels, which are called one after another.

Compilation/Runtime Info:
The code is compiled with -arch sm_35 and optimization level -O0 (to avoid anything being optimized and thus affecting the behavior). The computeprof experiment I ran uses 64 KB threads/data, since this should fit inside the L2 cache, which is 128 KB (and those avoid getting statistics skewed by have more data than can fit in the L2 cache).

Code Explanation:
Because Kepler doesn’t cache global accesses in L1, with the appropriate array size, all the requests (to arrA) should fit in L2, so there should be a large amounts of hits in the second kernel if that the L2 cache is not flushed in between kernel invocations.

I’ve run into a few things I don’t understand when looking at the outputted metric data in computeprof (partially listed below). I was hoping someone might have some insight into these:

  1. Why is the hit rate for the first kernel 100%? Given that these arrays haven’t been accessed before, it seems like this kernel must experience misses … is it related to the fact that computeprof is running the kernel multiple times to collect each of the different metrics?

  2. Why are all of my loads listed as uncached (uncached global load transaction == gld request)? Based on the info from nvprof --query-events, the uncached label seems to mean that my loads aren’t being cached at all; if this is true, then why would columns in the computeprof output such as ‘l2 subp2 read l1 hit sectors’ (or queries) have any numbers … shouldn’t there be no hits or queries in the L2 if the requests are uncached?

  3. Similarly, why doesn’t ‘l2 subp2 read sector queries’ == ‘l2 subp2 read sector misses’ + ‘l2 subp2 read l1 hit sectors’ (+ texture counters, except those are 0 in this case)?


// kernel2 is exactly same code, not pasted for brevity
global void kernel1(unsigned int * arrA)
const int index = (blockIdx.x * blockDim.x) + threadIdx.x;
// small computation on A to avoid getting optimized out
arrA[index] += COUNT;

// …

// relevant host code
N = atoi(argv[1]); // set to 65536 (64K) for this run
unsigned int * h_a, * d_a;
// …
dim3 threadsPerBlock(THREADS_PER_BLOCK, 1, 1);
dim3 numBlocks(N/THREADS_PER_BLOCK, 1, 1);

kernel1<<<numBlocks, threadsPerBlock>>>(d_a);

kernel2<<<numBlocks, threadsPerBlock>>>(d_a);
flushTest.cu (3.45 KB)
flushTest-kepler.xls (12.5 KB)

Note that Kepler never flushes L2 (apart from device initialization), because it is integrated directly into the memory controllers and always coherent with the off-chip memory (even CPU memory accesses go via the L2 cache).

Thanks tera! Can you point me to documentation (etc.) that talks about this?


Unfortunately there’s not much documentation around on there details.

Paulius Micikevicius’ GPU Performance Analysis and Optimization talk has some info, particularly that “all accesses to global memory go through L2, including CPU and peer GPU”.

Hmm, interesting. So would that mean that when the data is initially copied (cudaMemcpy) to the device, that it goes directly into the L2? And then, when we’re memcpy’ing back to the CPU, that also goes through the L2? If so, that seems really strange, since the memcpy’s are usually very large (larger than the L2).


Actually, going though the L2 makes a lot of sense when you consider that the PCIe controller is located directly on the GPU processor. Hence, any transfer goes through the standard GPU memory subsystem, and will go through the L2 just like any other memory transaction.

I came across this conversation and the issues brought up here were almost the same as what I had. Although the l2 cache and memory controller architecture you described is coherent with my observations and explains my issues, it seems a bit inefficient. Assuming we are transferring large amount of data from Host memory to GPU dram. According to what you said l2 cache works as buffer and cache all chunks of input data at least once and probably what remains in l2 cache at the end is the last chunks of data. Now the question is, doesn’t this increase the memory system power consumption for nothing?