About the flops in ncu report

FindHao · May 23, 2022, 5:51pm

Hi, I profiled a simple matrix-multiplication in pytorch and the flops got from ncu report is less than the theoretical peak flops.

The code I tested on A100 is the following

import torch
n=4096
x = torch.ones((n, n), dtype=torch.float32, device="cuda")
y = torch.ones((n, n),dtype=torch.float32, device="cuda")
# run the computation part
for i in range(200):
    if i % 100 == 0:
        print(i)
    torch.mm(x, y)

When I move the cursor to the roofline chart in ncu report, it says the performance is about 14TFLOPs. But the peak performance of A100 should be about 19TFLOPS. I check the source code for ncu roofline section in sections/SpeedOfLight_Roofline.py. The peak performance is 2 * action.metric_by_name("sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained").as_double().

Actually, I use dcgmi to test the above benchmark code and I’m able to get the 19TFLOPs. So I’m confused by the results.

Could you explain the meaning of sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained and why it is less than the theoretical flops?

FindHao · June 13, 2022, 9:27pm

Anybody could help me confirm that? I thought it should be a bug of NCU.

Sanjiv.Satoor · June 28, 2022, 3:53pm

Are you seeing the 14TFLOPS value for the “Single Precision Roofline”?
(when hovering the cursor over the “Single Precision” ridge point)

What is the clock_rate? You can find this under Device Attributes on the Session page in the Nsight Compute UI.

FindHao · June 28, 2022, 8:22pm

Hi @Sanjiv.Satoor ,

Yes.

The clock_rate is 1410000.

Sanjiv.Satoor · July 5, 2022, 8:55am

You can find the expression used for calculating the “Peak Work” value shown in the single precision roofline chart in “SpeedOfLight_RooflineChart.section”

Peak Work = derived__sm__sass_thread_inst_executed_op_ffma_pred_on_x2 * sm__cycles_elapsed.avg.per_second

Where:
derived__sm__sass_thread_inst_executed_op_ffma_pred_on_x2 = sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained * 2
(This is a derived metric defined in the same section file. Since a FMA instruction has two floating point operations 1 multiplication and 1 addition the FMA instruction count is multiplied by two.)

Units:

Peak Work : FLOP/second
sm__cycles_elapsed.avg.per_second : cycles/second
sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained : instructions/cycle

Sanjiv.Satoor · July 5, 2022, 12:44pm

Note that for consistent profiling results Nsight Compute attempts to limit GPU clock frequencies to their base value. Refer the Clock Control section in the Kernel Profiling Guide :: Nsight Compute Documentation.

To get the peak performance one will need to fix the clock frequency using nvidia-smi to the boost frequency and profile with the ncu --clock-control none option.

FindHao · July 12, 2022, 11:13pm

@Sanjiv.Satoor Thanks for your reply.

I have already used nvidia-smi to pin the clock rates. and also tried ncu --clock-control none option.

In my question description, I got the flops computation from the python file of section SpeedOfLight_Roofline. But my question is

According to the ncu’s results, the 2 * sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained is 14 TFLOPS. But for the same application, the results from DCGM is 19TFLOPS. And the test application is torch.mm whose TFLOPS is supposed to be 19.

Is there something wrong with the sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained?

Sanjiv.Satoor · July 15, 2022, 6:16am

The clock_rate value which I suggested checking earlier will not show the clock frequency used during profiling.
Please check what is the “SM Frequency” shown on the Nsight Compute Details page (the metric is gpc__cycles_elapsed.avg.per_second).

FindHao · July 20, 2022, 4:55pm

@Sanjiv.Satoor Sorry for the late reply.

The SM Frequency is 1.08. If it is the correct frequency, the clock_rate should be 1.08Ghz, right? So that’s why the peak TFLOPS is not 19.

Interesting. I have used nvidia-smit to pin the fix the clock_rate and add --clock-control none to ncu. not sure what’s wrong with the configurations.

FindHao · September 23, 2022, 5:09pm

Hi @Sanjiv.Satoor , after a long time, I did the test again.

On A100, I profile a long run application that ncu runs for several hours. During the execution, I use watch nvidia-smi -q -d CLOCK to monitor the frequency and I found the frequency is already the highest 1410Mhz but in ncu’s report the SM Frequency is still about 1.10. The Single Preicison Roofline is still about 14T.

I did another test on A5000. Profiled another application and monitored frequency. The frequency in nvidia-smi is the highest 2100Mhz and the SM Frequency in ncu report is 1.67. However, the Single Precision Roofline is about 27.4 TFLOPS which matches the specifications of A5000 27.7TFLOPS.

It is a weird result. Do you have any ideas?

Sanjiv.Satoor · May 30, 2024, 3:25am

Sorry for missing to reply earlier.

Can you please attach the ncu report and also share values for the metrics gpc__cycles_elapsed.avg.per_second and sm__cycles_elapsed.avg.per_second?

Based on that, we could further analyze this.

veraj · July 29, 2024, 12:00am

This topic was automatically closed after 6 days. New replies are no longer allowed.

Topic		Replies	Views
Nsight Compute Clock Speed During Profiling Nsight Compute	4	1768	March 31, 2022
Why the Compute Throughput's value is different from the actual Performance / Peak Performance Nsight Compute cuda , kernel , nsight , profiling	7	2996	October 28, 2022
How to measure Tensor FLOPs? CUDA Programming and Performance tensorrt , cuda , kernel	14	2501	May 15, 2024
Incorrect Peak Performance Boundaries in Nsight Compute Roofline Charts Nsight Compute	4	870	July 5, 2022
SM frequency reported in Nsight Compute Nsight Compute	4	935	September 1, 2023
GPU metrics in the Nsight System Profiling Linux Targets	3	911	October 15, 2024
What exactly does SM Active Cycles mean? Nsight Compute	3	1019	July 30, 2024
Nsight Compute-Roofline chart Nsight Compute	12	1567	September 20, 2024
Discrepancy in Tensor Core FP16 Performance Ceiling on H100 SXM Observed in Nsight Compute Nsight Compute	2	141	December 31, 2024
Nsight Compute and Roofline Model (And load-stores in Matrix Multiplications) Nsight Compute	4	810	June 28, 2023

About the flops in ncu report

Related topics