Hi, I’ve got a kernel which bottleneck is on L2 cache, and when I run it on A100, I found that the L2 bandwidth utilization rate is very low, only about 40% of the peak.
So I try to construct a kernel that can reach peak L2 utilization, but I can only reach about 55% based on the result of nsight compute, which is about 4900GB/s. And I found out that there are two types of L2 bandwidth on the web. One is 5120B/s, equals to 6723GB/s if I multiply frequency which is 1410M. Second is 2.3x of V100, which is 2.3*4100GB/s (9430GB/s).
My profiling result seems more close to the second one, so I want to know what is the peak bandwidth of L2 cache on A100, is this all the bandwidth between L2 and L1, or it also includes the bandwidth between two L2 partitions. Is L2 frequency identical to SM frequency? And is there any sample code to teach me how to get the peak L2 bandwidth. Thanks.