Hi,
I have been playing around with Coalesced memory the last couple of days. I am trying to implement the CUDA OffsetCopy kernel from the Best Practice guide and using the nvprof events to compute if my access pattern is coalesced.
From Paulius’ talk at GTC (GPU Performance Analysis and Optimization) he states that a good coalesced access pattern for doubles should result in a ratio of 2 in the following equation
(l1_global_load_miss + l1_global_load_hit)/gld_requested = 2 (if using double and access is coalesced)
This is also stated in a lot of CUDA documentation.
A different number should mean that the access is not optimal. So for the simple OffsetCopy test from the programming guide when setting the offset to zero the above ratio should be 2 (as I understand the lecture and docs).
I have shared my own version of the kernel at my github account: GitHub - LinuxChristian/simpleMemorySetup: A simple test setup to run different memory alignments on CUDA devices.
The copy kernel:
[b]
global void cuLoadStoreElement(real *M_in, real *M_out, int StoreMat, int offset) {
int tx = threadIdx.x; int ty = threadIdx.y;
int bx = blockIdx.x; int by = blockIdx.y;
int Ix = bx * blockDim.x + tx;
// Create linear index
int Iin = Ix + offset;
int Iout = Ix + offset;
// Load value from global and store it
M_out[Iout] = M_in[Iin];
};[/b]
When I run a 1D test (i.e. a linear array of lenght 4096 elements) with a zero offset, 128 threads pr. block and 1 blocks pr. grid the following results are reported by nvprof using the command line: nvprof --devices 0 --profile-from-start-off --events **** main
l1_global_load_miss 28
l1_global_load_hit 0
gld_requested 4
ratio (28 + 0)/4 = 7.0 (Which I have computed)
Here I am only profiling the copy kernel and nothing else. So I have 128 threads pr. block resulting in four warps of 32 threads each requesting to transfer 8 byte resulting in a total transfer of 1024 byte or 8 cache lines (when a cache line is 128 byte on the M2070 I use). I expect a zero hit rate because there should be no reuse of values and therefore nothing to cache (which is also what I get).
My question: I expect to transfer 8 cache lines from four warps each giving one gld_request (giving a 2 ratio) but this is not what the profiler is telling me. Am I understanding the events wrong or why is more data being transferred then the theory dictates? Is this a too literal test of the theory and how accurate it he ratio then to access pattern?
Note: If I run my code through NVVP with a zero offset I am told there is no problems with the kernel (and getting the same numbers as with nvprof) but if I change the offset to 1, NVVP will tell me that the memory is not coalesced. So NVVP is doing the right thing but how does the profiler figure out that the memory is not coalesced? It should compute the same ratio of 7.0 as I get with nvprof.