Confused about the L1/SMEM BW reported by Nsight-Compute Hierarchical Roofline plots

When I profile gemm kernels on a100 GPU and plot the hierarchical roofline of the GPU and achieved values of the kernels, I see the L1 peak traffic (BW) is around 58 TB/s. That seems too high and unrealistic compared to numbers reported in the literature, for example here

Note, the L1 peak traffic is calculated in Nsight-Compute tool using this formula: l1tex__t_bytes.sum.peak_sustained * l1tex__cycles_elapsed.avg.per_second

Thank you.

1 Like

This number is the bandwidth from the L1 cache, which is much faster than device memory. Are you seeing numbers for L1 cache at that link? I see the DRAM bandwidths in the table at the link you provided.

Thank you for your response.

Yes, I can see the L1 cache BW in the link I shared to be 19TB/s which is less than the number I see in Nsight roofline.

Thanks. I missed that somehow. Do you know the source of that number? This looks to be a chart that wasn’t created by NVIDIA. Having said that - I talked with the engineering team and it sounds like there may be an issue with the number used in Nsight Compute. I will file a bug to get it investigated and can update here if there is more information.

1 Like

As far as I understand, the numbers in this blog was derived from the official NVIDIA documentation and got confirmed through running benchmarks.

Hi @jmarusarz, are there any updates on this?

We have made progress on this internally that should be available in the next version. We don’t provide specific information on release dates, but feel free to check back in if you try the new version and are not seeing any changes.

1 Like

We’ve made changes in the 2023.2 version included in CTK 12.2. Can you please try this new version and let us know if you see different data? Thanks.

1 Like

Encountered the same issue while profiling on A100 GPU.
L1 peak bandwidth seems to be unchanged, while L2 increased by a factor of 1.5 (I am currently using version 2023.1.1, which the documentation says where the hierarchical roofline charts issue was said to be resolved).

There were other changes in 2023.1.1 but more fixes came in 2023.2. Please try that version to see what the results show.

Apparently I was querying the old metrics even if I was using the newer ncu versions. Sorry for the confusion; however, while expected values for L1 cache were now obtained of 13824 bytes/clock, the L2 now seems to be different from the reported 5120 bytes/clock bandwidth in the ampere whitepaper here. The profiler returns a value of 6912 bytes/clock instead.

Would you have any insight as to why it uses 6912 bytes/cycle instead of 5120 bytes/cycle? Much appreciated. This is based from the “Theoretical L2 Cache Bytes Accessible” l1tex__m_xbar2l1tex_read_bytes.sum.peak_sustained from the roofline chart section files of version 2023.2

A future version of the tool will change the definition in Nsight Compute to match the value of 5120 in the whitepaper. You can use that value manually until it is released. Thanks for submitting this issue.

Thanks for the response. I look forward to all the future advancements.