UPDATED: Attached actual .cu and .xls files. Removed partial computeprof results as a result, leaving code segment in case people find that easier to look through.
I’m running my experiments on a GTX 680 (Kepler-class) GPU with CUDA 5.0. Specifically, I’m running some small microbenchmarks to test the caching/flushing for Kepler between kernels. The kernel I’m running is listed below. All it does it increment a location in a global array per thread in each of the 2 kernels, which are called one after another.
The code is compiled with -arch sm_35 and optimization level -O0 (to avoid anything being optimized and thus affecting the behavior). The computeprof experiment I ran uses 64 KB threads/data, since this should fit inside the L2 cache, which is 128 KB (and those avoid getting statistics skewed by have more data than can fit in the L2 cache).
Because Kepler doesn’t cache global accesses in L1, with the appropriate array size, all the requests (to arrA) should fit in L2, so there should be a large amounts of hits in the second kernel if that the L2 cache is not flushed in between kernel invocations.
I’ve run into a few things I don’t understand when looking at the outputted metric data in computeprof (partially listed below). I was hoping someone might have some insight into these:
Why is the hit rate for the first kernel 100%? Given that these arrays haven’t been accessed before, it seems like this kernel must experience misses … is it related to the fact that computeprof is running the kernel multiple times to collect each of the different metrics?
Why are all of my loads listed as uncached (uncached global load transaction == gld request)? Based on the info from nvprof --query-events, the uncached label seems to mean that my loads aren’t being cached at all; if this is true, then why would columns in the computeprof output such as ‘l2 subp2 read l1 hit sectors’ (or queries) have any numbers … shouldn’t there be no hits or queries in the L2 if the requests are uncached?
Similarly, why doesn’t ‘l2 subp2 read sector queries’ == ‘l2 subp2 read sector misses’ + ‘l2 subp2 read l1 hit sectors’ (+ texture counters, except those are 0 in this case)?
// kernel2 is exactly same code, not pasted for brevity
global void kernel1(unsigned int * arrA)
const int index = (blockIdx.x * blockDim.x) + threadIdx.x;
// small computation on A to avoid getting optimized out
arrA[index] += COUNT;
// relevant host code
N = atoi(argv); // set to 65536 (64K) for this run
unsigned int * h_a, * d_a;
#define THREADS_PER_BLOCK 128
dim3 threadsPerBlock(THREADS_PER_BLOCK, 1, 1);
dim3 numBlocks(N/THREADS_PER_BLOCK, 1, 1);