blockDim.x of 2 and blockDim.x of 6 are both probably lousy choices.
An authoritative answer cannot be given IMO without understanding how you compute x and y indexes in kernel code. However I will assume the “canonical”:
int idx = threadIdx.x+blockDim.x*blockIdx.x;
int idy = threadIdx.y+blockDim.y*blockIdx.y;
Let’s also suppose you have a “two-dimensional” data set, and perhaps you are doing an operation like:
out[idy*width+idx] = in[idy*width+idx];
A blockDim.x of 2 and blockDim.y of 6 means you have 2 threads in the x direction and 6 in the y direction. (A total of 12 threads per block is bad choice performance-wise). The only warp in that block will have
2 threads that read from row q
2 threads that read from row q+1
2 threads that read from row q+2
2 threads that read from row q+5
The GPU cache lines are organized horizontally or row-wise in memory. Those first two threads above will force a cacheline load of at least 32 quantities and maybe 128 or more quantities, yet that warp only needs 2 of them. The next two threads will force another, different cacheline load, but only need 2 of the loaded quantities. And so on.
Now work the same description with the other case.
Then answer these 2 questions:
- How many cachelines had to be loaded per block in each case?
- Per block, what was the efficiency (bytes used/bytes loaded) in each case?
Hopefully that does it for you.
Yes, huge caches with zero inefficiencies (evictions, etc.) should mostly calm this down. But GPUs provide no guarantee of thread or threadblock execution order, so for large data sets (that exceed the cache size, and remember the L1 cache is “pretty small”) could easily thrash with poor block sizing choices like these.
It’s a lousy way to design GPU code (either one of those choices). Design as if caches did not exist. (memory still gets loaded segment by segment, as if there were 32-byte cachelines, even when no caching occurs). Ideally the caches are there to fix “small” problems that are hard to design out of your code. Don’t depend on them for the problems that are easy to design out.