Use of L2 cache

How to have a fully use of L2 cache? I try to profile a program and find a metric called: lts__t_sectors_srcunit_ltcfabric. Will this metrics result in a severe decrease of speed in memory access?

Also, as I try to use L2 cache better by increase data locality, how many MB data should I maintain for? For A100’s 40MB L2, will 32MB a good choice to avoid eviction?

Or if I need to avoid data transfer between 2 L2 cache partitions, so 16MB should be the max?

A100 has a L2 cache architecture that is partitioned between two halves. When access is needed from one half to the other, that access flows over the “fabric” and is measured by the metric you mentioned (and is also visible in the nsight compute GUI memory breakdown section).

There is no reason to assume that a particular level of activity on that fabric/metric is somehow a problem. If your analysis leads to a conclusion that that is a limiter to performance, then you are effectively cache-bound (a form of memory-bound) and AFAIK NVIDIA publishes no instructions that define how that “fabric” is used or how the traffic across it manifests, or what could be done at a code level to try and modify that fabric traffic.

If you have an optimization strategy for general L2-bounded cases, that would be the path to follow. The usual memory-bounded optimization suggestions apply. Make more efficient use of memory, reduce memory usage, reduce footprint to match L1, etc.

Thanks! So how about this question: “how many MB data should I make locality for on A100?” is 32MB small enough to prevent serious eviction? what about 64MB?

With MIG partitioning it could be possible to access the two halves separately.

With MIG, each instance’s processors have separate and isolated paths through the entire memory system - the on-chip crossbar ports, L2 cache banks, memory controllers, and DRAM address busses are all assigned uniquely to an individual instance. This ensures that an individual user’s workload can run with predictable throughput and latency, with the same L2 cache allocation and DRAM bandwidth, even if other tasks are thrashing their own caches or saturating their DRAM interfaces.

I would expect that the two A100 halves are natural MIG boundaries.

It of course depends on the application, if it can use with advantage half the L2 cache size, but for two instances each.

The L2 cache size is queryable as a device property. If you are aiming for a particular footprint, I see no reason not to aim for the actual cache size. If you know of some other streaming data that your kernel will use, then of course you could subtract that from the size if you wish, but you’ve provided no such information here. You could also do profiler-guided tuning perhaps (try different sizes and study cache hit rate metrics, or something like that.)

The A100 cache can also be partitioned by the user if you want to reserve specific sizes for specific data sets.

@Robert_Crovella I remember darkly to have read that there could be double entries in both halves of the L2 cache? Can you confirm or deny?

This is mentioned on page 7 of the Citadel “Dissecting Ampere” pdf.

MIG at 1/2 or smaller will avoid partition crossing.
In full MIG or non-MIG configuration the virtual address space to physical address space is interleaved across the partitions and is not controllable today by the user. This will result in values caching in both the Point of Coherence L2 slice and the Local Caching Node L2 slice resulting in 1/2 the L2 cache capacity in the case SMs on both sides of GPU are accessing the data.

So if I just access to less then 20MB data from L2 cache, will it be much faster then access to 21MB data? Will data transfer between L2 partitions cost lot more extra time?

Also, If I have X MB data need to load in all, but X1 MB per threadblock. Also, there will be Y MB data need to be write back in all. Each threadblock will write back Y1 MB data back to gmem. What will be the behavior of L2 cache? (assume 4 threadblock per SM)
Will the L2 cache just hold 4 * 108 * X1 + 4 * 108 * Y1 data in it ? Or if I need to take X, Y into account?

Even if you were to provide access patterns a caches behavior is temporal so I recommend you write the code and profile. If you have a larger working set than the cache then you will likely see a drop-off after 1/2 the cache size but the slope of the drop-off is dependent on the access patterns over time. CUDA provides an API for controlling L2 Access Properties. These can help hold frequently accessed data in L2 while marking data that is read only once (aka “streamed”) to be preferentially evicted first.

With load cache we are talking about repeatedly read data.

If now in xour kernel the data is specific per thread block (but larger than L1), then you can also often split the work of that block to several blocks (instead of accessing other data with the other blocks). So that the overall working set over all blocks is smaller and it fits into L2.

We also look at the data of concurrent blocks, if you have more blocks than SMs, so over time the data may change for all blocks.