Incorrect Peak Performance Boundaries in Nsight Compute Roofline Charts

DiegoJimenez · June 23, 2022, 12:04pm

Hello

I’m having trouble understanding why when plotting the roofline chart on compute for a kernel running on an A100, the FP64 peak performance boundary is set at around 7.5 TFLOP/s (V100 peak) and not around the actual 9.5 TFLOP/s it should use as limit. I’m attaching a screenshot from an analysis on my kernel. You can see the tool correctly identifies the A100 GPU but the peak performance as I hover over the plateau is wrong. I’m using Nsight compute version 2021.3. I tried this on version 2022.2 but when I hover over the boundaries the pop-up tooltip is not working (Linux version).

Is there something wrong here or does it set the peak performance as 7.5 because that’s the nominal peak and not the theoretical?

Thanks for your help.

Sanjiv.Satoor · June 28, 2022, 4:04pm

The roofline is constructed based on the clock rate at which the application was run (because that sets the upper limit).

What is the clock_rate for your profiling run? You can find this under Device Attributes on the Session page in the Nsight Compute UI.

DiegoJimenez · June 29, 2022, 3:41pm

Thanks for the answer @Sanjiv.Satoor , I’m really interested in correctly interpreting this chart. The reported clock_rate for my A100 was: 1305000

clock_rate

How is that clock_rate used to compute the FP boundary? Could you share that formula? I’m guessing the 9.5 TFLOP/s peak would then require some GPU Boost Clock?

DiegoJimenez · June 29, 2022, 5:31pm

I’ve figured it out. This has to do with the clock control option on Nsight Compute and the default application clock that’s set on the A100 I’m running on (1305 MHz). I should then set the clock rate through nvidia-smi and then profile the application with no clock control.

However, this still doesn’t add up the 7.5 TFLOP/s boundary that plot is showing. How is that peak boundary computed using the clock rate?

Sanjiv.Satoor · July 5, 2022, 9:00am

Please refer the reply posted here: About the flops in ncu report - #6 by Sanjiv.Satoor

Topic		Replies	Views
Why the Peak FLOP/s in Nsight Compute is much less than white paper provided? Nsight Compute	4	897	February 10, 2023
SOL SM and Roofline seem to contradict? Nsight Compute cuda , ubuntu	3	725	October 12, 2021
Nsight Compute slows down Tesla T4 processor clock during profiling Nsight Compute	5	882	October 12, 2021
Different achieved values in Roofline Nsight Compute	3	639	June 8, 2023
How to tell if a kernel is memory or compute bound CUDA Programming and Performance	8	9451	February 4, 2010
Is there any tool which can tell my kernel is compute bound or memory bound CUDA Programming and Performance	7	6146	April 3, 2010
Cuda roofline analysis when kernel is below the roof Nsight Compute	4	1195	March 9, 2023
Help analysing kernel performance through nSight CUDA Programming and Performance	2	841	January 22, 2014
How close to peak can you get on a CPU? CUDA Programming and Performance	33	3131	November 9, 2010
Nsight Compute Clock Speed During Profiling Nsight Compute	4	1971	March 31, 2022

Incorrect Peak Performance Boundaries in Nsight Compute Roofline Charts

Related topics