How to reach peak bandwidth of L2 cache on A100

783161219 · December 21, 2021, 6:26am

Hi, I’ve got a kernel which bottleneck is on L2 cache, and when I run it on A100, I found that the L2 bandwidth utilization rate is very low, only about 40% of the peak.

So I try to construct a kernel that can reach peak L2 utilization, but I can only reach about 55% based on the result of nsight compute, which is about 4900GB/s. And I found out that there are two types of L2 bandwidth on the web. One is 5120B/s, equals to 6723GB/s if I multiply frequency which is 1410M. Second is 2.3x of V100, which is 2.3*4100GB/s (9430GB/s).

My profiling result seems more close to the second one, so I want to know what is the peak bandwidth of L2 cache on A100, is this all the bandwidth between L2 and L1, or it also includes the bandwidth between two L2 partitions. Is L2 frequency identical to SM frequency? And is there any sample code to teach me how to get the peak L2 bandwidth. Thanks.

Robert_Crovella · December 21, 2021, 10:29am

V100 L2 bandwidth is about 2.1TB/s (measured)

Therefore I would expect A100 measured bandwidth to be in the range of 2.3 x 2.1 = 4.83 TB/s. That link gives a possible starting point for a code to measure it. Your 4900GB/s number seems reasonable.

For nsight compute the metric I would use is lts__t_sectors_srcunit_tex_op_read.per_second

I personally don’t ever expect to be able to write codes that reach peak peak bandwidth.

783161219 · December 21, 2021, 11:11am

This make sense, thank you very much.

system · January 4, 2022, 11:11am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to correctly write code to test A100 L2 bandwidth？ CUDA Programming and Performance	6	2516	October 17, 2023
L2 Bandwidth Value for A100 Calculation CUDA Programming and Performance	5	205	January 28, 2025
A100 L2 Partition Bandwidth CUDA Programming and Performance	3	485	June 4, 2024
Trouble to Reach Peak Bandwidth of A100 CUDA Programming and Performance cuda	8	236	July 29, 2025
Tesla K40 L2 bandwidth CUDA Programming and Performance	12	4165	December 23, 2015
L2 cache in A100 provides 179% hit rate! Nsight Compute	1	803	January 4, 2023
How to Get L1/L2 Cache Bandwidth for H20 or H100? General Topics and Other SDKs	0	129	February 7, 2025
L2 bandwidth profiling shows >100% peak (123.4%) on RTX 5090 Nsight Compute	6	133	October 9, 2025
Confused about the L1/SMEM BW reported by Nsight-Compute Hierarchical Roofline plots Nsight Compute	13	1835	August 17, 2023
L2 cache in A100 provides 179% hit rate! CUDA Programming and Performance	7	1529	December 25, 2022

How to reach peak bandwidth of L2 cache on A100

Related topics