Computing global memory coalescing using nvprof events


I have been playing around with Coalesced memory the last couple of days. I am trying to implement the CUDA OffsetCopy kernel from the Best Practice guide and using the nvprof events to compute if my access pattern is coalesced.
From Paulius’ talk at GTC (GPU Performance Analysis and Optimization) he states that a good coalesced access pattern for doubles should result in a ratio of 2 in the following equation

(l1_global_load_miss + l1_global_load_hit)/gld_requested = 2 (if using double and access is coalesced)

This is also stated in a lot of CUDA documentation.
A different number should mean that the access is not optimal. So for the simple OffsetCopy test from the programming guide when setting the offset to zero the above ratio should be 2 (as I understand the lecture and docs).

I have shared my own version of the kernel at my github account:

The copy kernel:
global void cuLoadStoreElement(real *M_in, real *M_out, int StoreMat, int offset) {

int tx = threadIdx.x; int ty = threadIdx.y;
int bx = blockIdx.x; int by = blockIdx.y;

int Ix = bx * blockDim.x + tx;

// Create linear index
int Iin = Ix + offset;
int Iout = Ix + offset;

// Load value from global and store it
M_out[Iout] = M_in[Iin];

When I run a 1D test (i.e. a linear array of lenght 4096 elements) with a zero offset, 128 threads pr. block and 1 blocks pr. grid the following results are reported by nvprof using the command line: nvprof --devices 0 --profile-from-start-off --events **** main

l1_global_load_miss 28
l1_global_load_hit 0
gld_requested 4
ratio (28 + 0)/4 = 7.0 (Which I have computed)

Here I am only profiling the copy kernel and nothing else. So I have 128 threads pr. block resulting in four warps of 32 threads each requesting to transfer 8 byte resulting in a total transfer of 1024 byte or 8 cache lines (when a cache line is 128 byte on the M2070 I use). I expect a zero hit rate because there should be no reuse of values and therefore nothing to cache (which is also what I get).

My question: I expect to transfer 8 cache lines from four warps each giving one gld_request (giving a 2 ratio) but this is not what the profiler is telling me. Am I understanding the events wrong or why is more data being transferred then the theory dictates? Is this a too literal test of the theory and how accurate it he ratio then to access pattern?

Note: If I run my code through NVVP with a zero offset I am told there is no problems with the kernel (and getting the same numbers as with nvprof) but if I change the offset to 1, NVVP will tell me that the memory is not coalesced. So NVVP is doing the right thing but how does the profiler figure out that the memory is not coalesced? It should compute the same ratio of 7.0 as I get with nvprof.

I had some time today to look a bit more into the problem and if you extend the test from copying only 8 cache lines (Using 128 threads in 4 warps) to launching 1000 blocks pr. grid of 128 threads pr. blocks the ratio changes to

Global load request: 4000
L1 Hit: 0 L1 Miss: 8148
Actual loads: 8148 L1 hit rate 0%
Ratio: 2.03

Which is now much closer to the “perfect” ratio of 2 for doubles.

It seems that the GPU is not able to give accurate profiling events if working on very small problems (only a couple of cache lines) but the method will work when running several thousand cache lines.