Clarification on the width of Coalesced Memory access for Ampere arch

With the concept of 32B sized sectors(unit of data moved) and 128B sized cache lines,

It seems like coalesced memory access leverages GPU’s ability to gather four sector-sized regions of data into one cache line.

I was wondering if coalesced memory access shows a better performance when the four sectors of a cacheline are also sequentially located in global memory, as compared to accessing four sectors that are scattered across global memory.

There is potentially a benefit, because although each cacheline consists of 4 sectors, there is only one cache tag per line. Once a sector is assigned to a particular place in memory, the other sectors in that line cannot be assigned to “arbitrary/random” places in memory.

So if your cacheable “footprint” is less than 1/4 of the cache size, this is unlikely to matter (I think) and “random”/“scattered” 32-byte cached sectors should be OK. But if your cacheable footprint exceeds 1/4 of the cache size (4 sectors/line) then I think its quite possible that adjacency of access will improve cache utilization, with the corresponding possibility of performance improvement.

2 Likes

Thank you for the prompt response.

If this is the case, I wonder what the motivation behind adapting a sub-cacheline granularity.

The only benefit I can imagine is reducing the amount of data transferred (from 128B to 32, 64 or 96B) if the needed data is smaller than 128B.

Generally, the benefits of sectored caches are:

(1) Reduction in tag storage (useful when tags are on-chip but data is stored off-chip)
(2) Reduction of load bandwidth and / or improvement in latency in case of miss

As I understand it, the sectored caches used in GPUs reduce benefit (2) by implementing “adjacent sector prefetch” (there is probably a better term for this that escapes me at the moment).

1 Like

That certainly would have to be one of the top-level benefits. Since early days of CUDA (since the advent of Fermi arch) this idea had been part of CUDA training materials. You can find references (e.g. slides 23-24) to suggestions to use -Xptxas dlcm=cg to “skip” the L1 and load via the L2 only, because the L2 had the sectored arrangement that Fermi (and Kepler and I think Maxwell) did not have in the L1, and so it results in better bus utilization under heavy load in the “scattered” case.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.