Understanding Caching/Flushing Behavior/Performance in computeprof for Kepler

msinclair · May 23, 2013, 7:20pm

Hi everyone,

UPDATED: Attached actual .cu and .xls files. Removed partial computeprof results as a result, leaving code segment in case people find that easier to look through.

I’m running my experiments on a GTX 680 (Kepler-class) GPU with CUDA 5.0. Specifically, I’m running some small microbenchmarks to test the caching/flushing for Kepler between kernels. The kernel I’m running is listed below. All it does it increment a location in a global array per thread in each of the 2 kernels, which are called one after another.

Compilation/Runtime Info:
The code is compiled with -arch sm_35 and optimization level -O0 (to avoid anything being optimized and thus affecting the behavior). The computeprof experiment I ran uses 64 KB threads/data, since this should fit inside the L2 cache, which is 128 KB (and those avoid getting statistics skewed by have more data than can fit in the L2 cache).

Code Explanation:
Because Kepler doesn’t cache global accesses in L1, with the appropriate array size, all the requests (to arrA) should fit in L2, so there should be a large amounts of hits in the second kernel if that the L2 cache is not flushed in between kernel invocations.

I’ve run into a few things I don’t understand when looking at the outputted metric data in computeprof (partially listed below). I was hoping someone might have some insight into these:

Why is the hit rate for the first kernel 100%? Given that these arrays haven’t been accessed before, it seems like this kernel must experience misses … is it related to the fact that computeprof is running the kernel multiple times to collect each of the different metrics?
Why are all of my loads listed as uncached (uncached global load transaction == gld request)? Based on the info from nvprof --query-events, the uncached label seems to mean that my loads aren’t being cached at all; if this is true, then why would columns in the computeprof output such as ‘l2 subp2 read l1 hit sectors’ (or queries) have any numbers … shouldn’t there be no hits or queries in the L2 if the requests are uncached?
Similarly, why doesn’t ‘l2 subp2 read sector queries’ == ‘l2 subp2 read sector misses’ + ‘l2 subp2 read l1 hit sectors’ (+ texture counters, except those are 0 in this case)?

Thanks,
Matt

// kernel2 is exactly same code, not pasted for brevity
global void kernel1(unsigned int * arrA)
{
const int index = (blockIdx.x * blockDim.x) + threadIdx.x;
// small computation on A to avoid getting optimized out
arrA[index] += COUNT;
}

// …

// relevant host code
N = atoi(argv[1]); // set to 65536 (64K) for this run
unsigned int * h_a, * d_a;
// …
#define THREADS_PER_BLOCK 128
dim3 threadsPerBlock(THREADS_PER_BLOCK, 1, 1);
dim3 numBlocks(N/THREADS_PER_BLOCK, 1, 1);

kernel1<<<numBlocks, threadsPerBlock>>>(d_a);
cudaThreadSynchronize();

kernel2<<<numBlocks, threadsPerBlock>>>(d_a);
cudaThreadSynchronize();
flushTest.cu (3.45 KB)
flushTest-kepler.xls (12.5 KB)

tera · May 24, 2013, 8:11pm

Note that Kepler never flushes L2 (apart from device initialization), because it is integrated directly into the memory controllers and always coherent with the off-chip memory (even CPU memory accesses go via the L2 cache).

msinclair · May 24, 2013, 8:26pm

Thanks tera! Can you point me to documentation (etc.) that talks about this?

Matt

tera · May 24, 2013, 8:33pm

Unfortunately there’s not much documentation around on there details.

Paulius Micikevicius’ GPU Performance Analysis and Optimization talk has some info, particularly that “all accesses to global memory go through L2, including CPU and peer GPU”.

msinclair · May 24, 2013, 10:27pm

Hmm, interesting. So would that mean that when the data is initially copied (cudaMemcpy) to the device, that it goes directly into the L2? And then, when we’re memcpy’ing back to the CPU, that also goes through the L2? If so, that seems really strange, since the memcpy’s are usually very large (larger than the L2).

Thanks,
Matt

Keldor314 · May 27, 2013, 1:44am

Actually, going though the L2 makes a lot of sense when you consider that the PCIe controller is located directly on the GPU processor. Hence, any transfer goes through the standard GPU memory subsystem, and will go through the L2 just like any other memory transaction.

_EBAD · September 19, 2014, 12:13am

Hi,
I came across this conversation and the issues brought up here were almost the same as what I had. Although the l2 cache and memory controller architecture you described is coherent with my observations and explains my issues, it seems a bit inefficient. Assuming we are transferring large amount of data from Host memory to GPU dram. According to what you said l2 cache works as buffer and cache all chunks of input data at least once and probably what remains in l2 cache at the end is the last chunks of data. Now the question is, doesn’t this increase the memory system power consumption for nothing?

Topic		Replies	Views
L1 and L2 cache hit rate CUDA Programming and Performance	8	6587	February 3, 2016
Difference between L2 read/write transactions and L2_L1 read/write transactions ? CUDA Programming and Performance	3	1467	August 28, 2019
Cache L1 and L2 Architecture Kepler CUDA Programming and Performance	2	3185	December 30, 2019
cudaMemcpy() and L2 cache. CUDA Programming and Performance	9	3486	May 6, 2023
Tesla K40 L2 bandwidth CUDA Programming and Performance	12	4041	December 23, 2015
Understanding the functioning of nvprof and .cv data load option CUDA Programming and Performance	8	3079	December 11, 2014
Question on the L1 caching of the GK 110 CUDA Programming and Performance	17	7157	April 17, 2013
L2 cache (.cg) memory load performance CUDA Programming and Performance	6	1702	January 5, 2017
Memory transaction size CUDA Programming and Performance	1	1736	February 12, 2017
K20 vs C2075 L1 cache CUDA Programming and Performance	4	1378	July 11, 2013

Understanding Caching/Flushing Behavior/Performance in computeprof for Kepler

Related topics