that seems like a reversal of your previous position:
I don’t understand this statement at all:
All data traffic to main memory (global device memory) flows through the L2 cache. If the L2 cache reports a higher or lower efficiency (e.g. read throughput) that will most certainly show up in a memory bound code as an effect on performance. Rather than trying to come up with reasons why we should discount the profiler data, I’d rather use the profiler data to inform possible theories that might explain it.
So the profiler output is not a complete answer, but I believe it is a useful starting point to develop hypotheses.
I don’t have a solidly tested theory, but I would start with an additional assumption that the L2 cache is not fully associative. (e.g. perhaps it is 4-way set associative, or 8-way set associative, or something like that). I do not know this to be true, but I think microbenchmarking can support or disprove the assumption, and in my experience fully associative cache designs are rare. In addition, we know that covering a very large address space (large data set) will involve the TLB. The memory footprint for your test is ~1GB, which is probably on the “edge” of where TLB inefficiency may come into play, for strided access patterns over a large data set.
So what are the actual access patterns, and do they differ for the different grid size configurations?
Each thread is following a grid-striding loop. For the largest grid, the grid width is 12582912, and so each thread in the grid will stride 10 times, taking hops that are 1/10 of the data set size. For the smallest grid, each thread will take approximately 6400 hops. Furthermore, we must consider that inevitably, warps across the grid will get widely out of sync with each other, and also threadblocks will retire and allow new threadblocks to start in the largest grid case, but maybe not so much or at all in the smallest grid case. In the smallest grid case, a K80 GPU with 13 SMs and its extra large register file might be able to support 8 of your threadblocks at once (8192 < 2048), which would account for the entire grid being resident for most of the duration of the kernel execution (813 = 104 which is greater than 102).
My conjecture is that when you roll this chaotic access pattern together, the larger grid results, either due to cache set associativity, or TLB patterns (or both), in a more efficient use of the L2 cache. Again, I have not connected all the dots, but this theory at least is consistent with the profiler data. (Normally I wouldn’t expect TLB patterns to have much of an effect for a ~1GB footprint, but you are asking for an explanation of a 7% performance difference here, so I wouldn’t rule it out as a possible small contributor.)
It may seem counterintuitive, but I suspect that that block retirement that will occur regularly with the larger grid size (coupled with a coarse grid-striding loop) may actually lead to a more organized access pattern on average, than the case where the entire grid is resident on the GPU, and the pattern could become almost completely chaotic.
Or feel free to advance your own theory. If your theory discounts the L2 profiler data which lines up nicely with the observed perf difference, I will be skeptical.
Regarding your “dummy” case, I discount that data. It’s not reflective of what to expect in a memory bound code. The GPU is a latency hiding machine, and this includes nearly all machine latencies you can imagine, even the latency associated with launching a large number of blocks. Your dummy case is not memory bound. Your saxpy case is. Even if the differences evident in the dummy case manifested in the saxpy case, it’s on the order of 0.2ms whereas the 7% difference you’re chasing is on the order of 0.7ms, so something else must be at work, even if the threadblock launch latency is a factor.