Confused about the L1/SMEM BW reported by Nsight-Compute Hierarchical Roofline plots

When I profile gemm kernels on a100 GPU and plot the hierarchical roofline of the GPU and achieved values of the kernels, I see the L1 peak traffic (BW) is around 58 TB/s. That seems too high and unrealistic compared to numbers reported in the literature, for example here

Note, the L1 peak traffic is calculated in Nsight-Compute tool using this formula: l1tex__t_bytes.sum.peak_sustained * l1tex__cycles_elapsed.avg.per_second

Thank you.

This number is the bandwidth from the L1 cache, which is much faster than device memory. Are you seeing numbers for L1 cache at that link? I see the DRAM bandwidths in the table at the link you provided.

Thank you for your response.

Yes, I can see the L1 cache BW in the link I shared to be 19TB/s which is less than the number I see in Nsight roofline.

Thanks. I missed that somehow. Do you know the source of that number? This looks to be a chart that wasn’t created by NVIDIA. Having said that - I talked with the engineering team and it sounds like there may be an issue with the number used in Nsight Compute. I will file a bug to get it investigated and can update here if there is more information.

1 Like

As far as I understand, the numbers in this blog was derived from the official NVIDIA documentation and got confirmed through running benchmarks.