With the concept of 32B sized sectors(unit of data moved) and 128B sized cache lines,
It seems like coalesced memory access leverages GPU’s ability to gather four sector-sized regions of data into one cache line.
I was wondering if coalesced memory access shows a better performance when the four sectors of a cacheline are also sequentially located in global memory, as compared to accessing four sectors that are scattered across global memory.
There is potentially a benefit, because although each cacheline consists of 4 sectors, there is only one cache tag per line. Once a sector is assigned to a particular place in memory, the other sectors in that line cannot be assigned to “arbitrary/random” places in memory.
So if your cacheable “footprint” is less than 1/4 of the cache size, this is unlikely to matter (I think) and “random”/“scattered” 32-byte cached sectors should be OK. But if your cacheable footprint exceeds 1/4 of the cache size (4 sectors/line) then I think its quite possible that adjacency of access will improve cache utilization, with the corresponding possibility of performance improvement.
(1) Reduction in tag storage (useful when tags are on-chip but data is stored off-chip)
(2) Reduction of load bandwidth and / or improvement in latency in case of miss
As I understand it, the sectored caches used in GPUs reduce benefit (2) by implementing “adjacent sector prefetch” (there is probably a better term for this that escapes me at the moment).
That certainly would have to be one of the top-level benefits. Since early days of CUDA (since the advent of Fermi arch) this idea had been part of CUDA training materials. You can find references (e.g. slides 23-24) to suggestions to use -Xptxas dlcm=cg to “skip” the L1 and load via the L2 only, because the L2 had the sectored arrangement that Fermi (and Kepler and I think Maxwell) did not have in the L1, and so it results in better bus utilization under heavy load in the “scattered” case.