Confused about the L1/SMEM BW reported by Nsight-Compute Hierarchical Roofline plots

m_ali102 · November 1, 2022, 7:17pm

When I profile gemm kernels on a100 GPU and plot the hierarchical roofline of the GPU and achieved values of the kernels, I see the L1 peak traffic (BW) is around 58 TB/s. That seems too high and unrealistic compared to numbers reported in the literature, for example here

Note, the L1 peak traffic is calculated in Nsight-Compute tool using this formula: l1tex__t_bytes.sum.peak_sustained * l1tex__cycles_elapsed.avg.per_second

Thank you.

jmarusarz · November 2, 2022, 3:04pm

This number is the bandwidth from the L1 cache, which is much faster than device memory. Are you seeing numbers for L1 cache at that link? I see the DRAM bandwidths in the table at the link you provided.

m_ali102 · November 2, 2022, 3:24pm

Thank you for your response.

Yes, I can see the L1 cache BW in the link I shared to be 19TB/s which is less than the number I see in Nsight roofline.

jmarusarz · November 3, 2022, 8:39pm

Thanks. I missed that somehow. Do you know the source of that number? This looks to be a chart that wasn’t created by NVIDIA. Having said that - I talked with the engineering team and it sounds like there may be an issue with the number used in Nsight Compute. I will file a bug to get it investigated and can update here if there is more information.

m_ali102 · November 3, 2022, 9:27pm

As far as I understand, the numbers in this blog was derived from the official NVIDIA documentation and got confirmed through running benchmarks.

tanvishr1197 · March 22, 2023, 5:07pm

Hi @jmarusarz, are there any updates on this?

jmarusarz · March 22, 2023, 8:54pm

We have made progress on this internally that should be available in the next version. We don’t provide specific information on release dates, but feel free to check back in if you try the new version and are not seeing any changes.

jmarusarz · June 29, 2023, 8:37pm

We’ve made changes in the 2023.2 version included in CTK 12.2. Can you please try this new version and let us know if you see different data? Thanks.

LOYy5UWATx · July 27, 2023, 10:24am

Encountered the same issue while profiling on A100 GPU.
L1 peak bandwidth seems to be unchanged, while L2 increased by a factor of 1.5 (I am currently using version 2023.1.1, which the documentation says where the hierarchical roofline charts issue was said to be resolved).

jmarusarz · July 27, 2023, 8:15pm

There were other changes in 2023.1.1 but more fixes came in 2023.2. Please try that version to see what the results show.

LOYy5UWATx · July 28, 2023, 9:47am

Apparently I was querying the old metrics even if I was using the newer ncu versions. Sorry for the confusion; however, while expected values for L1 cache were now obtained of 13824 bytes/clock, the L2 now seems to be different from the reported 5120 bytes/clock bandwidth in the ampere whitepaper here. The profiler returns a value of 6912 bytes/clock instead.

LOYy5UWATx · July 31, 2023, 8:25am

Would you have any insight as to why it uses 6912 bytes/cycle instead of 5120 bytes/cycle? Much appreciated. This is based from the “Theoretical L2 Cache Bytes Accessible” l1tex__m_xbar2l1tex_read_bytes.sum.peak_sustained from the roofline chart section files of version 2023.2

jmarusarz · August 17, 2023, 8:42pm

A future version of the tool will change the definition in Nsight Compute to match the value of 5120 in the whitepaper. You can use that value manually until it is released. Thanks for submitting this issue.

LOYy5UWATx · August 17, 2023, 9:34pm

Thanks for the response. I look forward to all the future advancements.

Topic		Replies	Views
Confusion about NSight Compute profiler results Nsight Compute cuda , kernel , nvbugs	1	526	June 5, 2020
Different achieved values in Roofline Nsight Compute	3	594	June 8, 2023
How to reach peak bandwidth of L2 cache on A100 CUDA Programming and Performance	3	1693	January 4, 2022
Measuring L1/SMEM throughput on V100 using nvprof CUDA Programming and Performance	4	651	October 22, 2020
Accelerating HPC Applications with NVIDIA Nsight Compute Roofline Analysis Technical Blog	2	356	September 25, 2024
Why the Compute Throughput's value is different from the actual Performance / Peak Performance Nsight Compute cuda , kernel , nsight , profiling	7	3175	October 28, 2022
Incorrect Peak Performance Boundaries in Nsight Compute Roofline Charts Nsight Compute	4	887	July 5, 2022
How to Understand Peak Traffic in the Roofline Model? Nsight Compute	8	839	September 9, 2022
Understanding Memory Tables and Roofline Modell Nsight Compute	3	653	August 19, 2022
L2 Bandwidth Value for A100 Calculation CUDA Programming and Performance	5	84	January 28, 2025

Confused about the L1/SMEM BW reported by Nsight-Compute Hierarchical Roofline plots

Related topics