Use of L2 cache

half-0 · March 25, 2025, 8:20am

How to have a fully use of L2 cache? I try to profile a program and find a metric called: lts__t_sectors_srcunit_ltcfabric. Will this metrics result in a severe decrease of speed in memory access?

half-0 · March 25, 2025, 8:24am

Also, as I try to use L2 cache better by increase data locality, how many MB data should I maintain for? For A100’s 40MB L2, will 32MB a good choice to avoid eviction?

half-0 · March 25, 2025, 2:53pm

Or if I need to avoid data transfer between 2 L2 cache partitions, so 16MB should be the max?

Robert_Crovella · March 25, 2025, 4:21pm

A100 has a L2 cache architecture that is partitioned between two halves. When access is needed from one half to the other, that access flows over the “fabric” and is measured by the metric you mentioned (and is also visible in the nsight compute GUI memory breakdown section).

There is no reason to assume that a particular level of activity on that fabric/metric is somehow a problem. If your analysis leads to a conclusion that that is a limiter to performance, then you are effectively cache-bound (a form of memory-bound) and AFAIK NVIDIA publishes no instructions that define how that “fabric” is used or how the traffic across it manifests, or what could be done at a code level to try and modify that fabric traffic.

If you have an optimization strategy for general L2-bounded cases, that would be the path to follow. The usual memory-bounded optimization suggestions apply. Make more efficient use of memory, reduce memory usage, reduce footprint to match L1, etc.

half-0 · March 25, 2025, 4:27pm

Thanks! So how about this question: “how many MB data should I make locality for on A100?” is 32MB small enough to prevent serious eviction? what about 64MB?

Curefab · March 25, 2025, 4:40pm

With MIG partitioning it could be possible to access the two halves separately.

With MIG, each instance’s processors have separate and isolated paths through the entire memory system - the on-chip crossbar ports, L2 cache banks, memory controllers, and DRAM address busses are all assigned uniquely to an individual instance. This ensures that an individual user’s workload can run with predictable throughput and latency, with the same L2 cache allocation and DRAM bandwidth, even if other tasks are thrashing their own caches or saturating their DRAM interfaces.

I would expect that the two A100 halves are natural MIG boundaries.

It of course depends on the application, if it can use with advantage half the L2 cache size, but for two instances each.

Robert_Crovella · March 25, 2025, 6:30pm

The L2 cache size is queryable as a device property. If you are aiming for a particular footprint, I see no reason not to aim for the actual cache size. If you know of some other streaming data that your kernel will use, then of course you could subtract that from the size if you wish, but you’ve provided no such information here. You could also do profiler-guided tuning perhaps (try different sizes and study cache hit rate metrics, or something like that.)

The A100 cache can also be partitioned by the user if you want to reserve specific sizes for specific data sets.

Curefab · March 25, 2025, 8:11pm

@Robert_Crovella I remember darkly to have read that there could be double entries in both halves of the L2 cache? Can you confirm or deny?

rs277 · March 25, 2025, 8:25pm

This is mentioned on page 7 of the Citadel “Dissecting Ampere” pdf.

Greg · March 25, 2025, 9:47pm

MIG at 1/2 or smaller will avoid partition crossing.
In full MIG or non-MIG configuration the virtual address space to physical address space is interleaved across the partitions and is not controllable today by the user. This will result in values caching in both the Point of Coherence L2 slice and the Local Caching Node L2 slice resulting in 1/2 the L2 cache capacity in the case SMs on both sides of GPU are accessing the data.

half-0 · March 26, 2025, 1:36am

So if I just access to less then 20MB data from L2 cache, will it be much faster then access to 21MB data? Will data transfer between L2 partitions cost lot more extra time?

half-0 · March 26, 2025, 1:57am

Also, If I have X MB data need to load in all, but X1 MB per threadblock. Also, there will be Y MB data need to be write back in all. Each threadblock will write back Y1 MB data back to gmem. What will be the behavior of L2 cache? (assume 4 threadblock per SM)
Will the L2 cache just hold 4 * 108 * X1 + 4 * 108 * Y1 data in it ? Or if I need to take X, Y into account?

Greg · March 26, 2025, 3:46am

Even if you were to provide access patterns a caches behavior is temporal so I recommend you write the code and profile. If you have a larger working set than the cache then you will likely see a drop-off after 1/2 the cache size but the slope of the drop-off is dependent on the access patterns over time. CUDA provides an API for controlling L2 Access Properties. These can help hold frequently accessed data in L2 while marking data that is read only once (aka “streamed”) to be preferentially evicted first.

Curefab · March 26, 2025, 8:09am

With load cache we are talking about repeatedly read data.

If now in xour kernel the data is specific per thread block (but larger than L1), then you can also often split the work of that block to several blocks (instead of accessing other data with the other blocks). So that the overall working set over all blocks is smaller and it fits into L2.

We also look at the data of concurrent blocks, if you have more blocks than SMs, so over time the data may change for all blocks.

Topic		Replies	Views
Fermi L2 cache How fast is the L2 cache? How do I access it? CUDA Programming and Performance	11	26260	December 2, 2011
CUDA: How do I use L2 cache in Fermi? Legacy PGI Compilers	3	5419	June 22, 2011
2D spatial locality for L2 cache on Fermi CUDA Programming and Performance	8	2446	January 19, 2011
Can I disable L2 caching? CUDA Programming and Performance	3	2513	May 27, 2015
Cache line size of L1 and L2 CUDA Programming and Performance	3	21194	November 14, 2011
Meanings of L2 --> L2 copy Nsight Compute	1	705	January 17, 2022
Memory Transaction Width and L2 Cache Fill - Compute Capability width 2.x and 3.0 CUDA Programming and Performance	3	1377	June 28, 2012
Tesla K40 L2 bandwidth CUDA Programming and Performance	12	4082	December 23, 2015
Fermi cache performance L1 vs L2 cache CUDA Programming and Performance	0	787	May 1, 2010
C2050 memory model CUDA Programming and Performance	8	12753	August 19, 2010

Use of L2 cache

Related topics