How to reach peak bandwidth of L2 cache on A100

Hi, I’ve got a kernel which bottleneck is on L2 cache, and when I run it on A100, I found that the L2 bandwidth utilization rate is very low, only about 40% of the peak.

So I try to construct a kernel that can reach peak L2 utilization, but I can only reach about 55% based on the result of nsight compute, which is about 4900GB/s. And I found out that there are two types of L2 bandwidth on the web. One is 5120B/s, equals to 6723GB/s if I multiply frequency which is 1410M. Second is 2.3x of V100, which is 2.3*4100GB/s (9430GB/s).

My profiling result seems more close to the second one, so I want to know what is the peak bandwidth of L2 cache on A100, is this all the bandwidth between L2 and L1, or it also includes the bandwidth between two L2 partitions. Is L2 frequency identical to SM frequency? And is there any sample code to teach me how to get the peak L2 bandwidth. Thanks.

V100 L2 bandwidth is about 2.1TB/s (measured)

Therefore I would expect A100 measured bandwidth to be in the range of 2.3 x 2.1 = 4.83 TB/s. That link gives a possible starting point for a code to measure it. Your 4900GB/s number seems reasonable.

For nsight compute the metric I would use is lts​_​_t​_sectors​_srcunit​_tex​_op​_read.per​_second

I personally don’t ever expect to be able to write codes that reach peak peak bandwidth.

1 Like

This make sense, thank you very much.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.