About the flops in ncu report

Hi, I profiled a simple matrix-multiplication in pytorch and the flops got from ncu report is less than the theoretical peak flops.

The code I tested on A100 is the following

import torch
n=4096
x = torch.ones((n, n), dtype=torch.float32, device="cuda")
y = torch.ones((n, n),dtype=torch.float32, device="cuda")
# run the computation part
for i in range(200):
    if i % 100 == 0:
        print(i)
    torch.mm(x, y)

When I move the cursor to the roofline chart in ncu report, it says the performance is about 14TFLOPs. But the peak performance of A100 should be about 19TFLOPS. I check the source code for ncu roofline section in sections/SpeedOfLight_Roofline.py. The peak performance is 2 * action.metric_by_name("sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained").as_double().

Actually, I use dcgmi to test the above benchmark code and I’m able to get the 19TFLOPs. So I’m confused by the results.

Could you explain the meaning of sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained and why it is less than the theoretical flops?

Anybody could help me confirm that? I thought it should be a bug of NCU.

Are you seeing the 14TFLOPS value for the “Single Precision Roofline”?
(when hovering the cursor over the “Single Precision” ridge point)

What is the clock_rate? You can find this under Device Attributes on the Session page in the Nsight Compute UI.

Hi @Sanjiv.Satoor ,

Yes.

The clock_rate is 1410000.

You can find the expression used for calculating the “Peak Work” value shown in the single precision roofline chart in “SpeedOfLight_RooflineChart.section”

Peak Work = derived__sm__sass_thread_inst_executed_op_ffma_pred_on_x2 * sm__cycles_elapsed.avg.per_second

Where:
derived__sm__sass_thread_inst_executed_op_ffma_pred_on_x2 = sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained * 2
(This is a derived metric defined in the same section file. Since a FMA instruction has two floating point operations 1 multiplication and 1 addition the FMA instruction count is multiplied by two.)

Units:

  • Peak Work : FLOP/second
  • sm__cycles_elapsed.avg.per_second : cycles/second
  • sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained : instructions/cycle

Note that for consistent profiling results Nsight Compute attempts to limit GPU clock frequencies to their base value. Refer the Clock Control section in the Kernel Profiling Guide :: Nsight Compute Documentation.

To get the peak performance one will need to fix the clock frequency using nvidia-smi to the boost frequency and profile with the ncu --clock-control none option.

@Sanjiv.Satoor Thanks for your reply.

I have already used nvidia-smi to pin the clock rates. and also tried ncu --clock-control none option.

In my question description, I got the flops computation from the python file of section SpeedOfLight_Roofline. But my question is

According to the ncu’s results, the 2 * sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained is 14 TFLOPS. But for the same application, the results from DCGM is 19TFLOPS. And the test application is torch.mm whose TFLOPS is supposed to be 19.

Is there something wrong with the sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained?

The clock_rate value which I suggested checking earlier will not show the clock frequency used during profiling.
Please check what is the “SM Frequency” shown on the Nsight Compute Details page (the metric is gpc__cycles_elapsed.avg.per_second).

@Sanjiv.Satoor Sorry for the late reply.

The SM Frequency is 1.08. If it is the correct frequency, the clock_rate should be 1.08Ghz, right? So that’s why the peak TFLOPS is not 19.

Interesting. I have used nvidia-smit to pin the fix the clock_rate and add --clock-control none to ncu. not sure what’s wrong with the configurations.

Hi @Sanjiv.Satoor , after a long time, I did the test again.

On A100, I profile a long run application that ncu runs for several hours. During the execution, I use watch nvidia-smi -q -d CLOCK to monitor the frequency and I found the frequency is already the highest 1410Mhz but in ncu’s report the SM Frequency is still about 1.10. The Single Preicison Roofline is still about 14T.

I did another test on A5000. Profiled another application and monitored frequency. The frequency in nvidia-smi is the highest 2100Mhz and the SM Frequency in ncu report is 1.67. However, the Single Precision Roofline is about 27.4 TFLOPS which matches the specifications of A5000 27.7TFLOPS.

It is a weird result. Do you have any ideas?