Hi, I profiled a simple matrix-multiplication in pytorch and the flops got from ncu report is less than the theoretical peak flops.
The code I tested on A100 is the following
import torch
n=4096
x = torch.ones((n, n), dtype=torch.float32, device="cuda")
y = torch.ones((n, n),dtype=torch.float32, device="cuda")
# run the computation part
for i in range(200):
if i % 100 == 0:
print(i)
torch.mm(x, y)
When I move the cursor to the roofline chart in ncu report, it says the performance is about 14TFLOPs. But the peak performance of A100 should be about 19TFLOPS. I check the source code for ncu roofline section in sections/SpeedOfLight_Roofline.py. The peak performance is 2 * action.metric_by_name("sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained").as_double().
Actually, I use dcgmi to test the above benchmark code and I’m able to get the 19TFLOPs. So I’m confused by the results.
Could you explain the meaning of sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained and why it is less than the theoretical flops?
You can find the expression used for calculating the “Peak Work” value shown in the single precision roofline chart in “SpeedOfLight_RooflineChart.section”
Peak Work = derived__sm__sass_thread_inst_executed_op_ffma_pred_on_x2 * sm__cycles_elapsed.avg.per_second
Where: derived__sm__sass_thread_inst_executed_op_ffma_pred_on_x2 = sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained * 2
(This is a derived metric defined in the same section file. Since a FMA instruction has two floating point operations 1 multiplication and 1 addition the FMA instruction count is multiplied by two.)
To get the peak performance one will need to fix the clock frequency using nvidia-smi to the boost frequency and profile with the ncu --clock-control none option.
I have already used nvidia-smi to pin the clock rates. and also tried ncu --clock-control none option.
In my question description, I got the flops computation from the python file of section SpeedOfLight_Roofline. But my question is
According to the ncu’s results, the 2 * sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained is 14 TFLOPS. But for the same application, the results from DCGM is 19TFLOPS. And the test application is torch.mm whose TFLOPS is supposed to be 19.
Is there something wrong with the sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained?
The clock_rate value which I suggested checking earlier will not show the clock frequency used during profiling.
Please check what is the “SM Frequency” shown on the Nsight Compute Details page (the metric is gpc__cycles_elapsed.avg.per_second).
Hi @Sanjiv.Satoor , after a long time, I did the test again.
On A100, I profile a long run application that ncu runs for several hours. During the execution, I use watch nvidia-smi -q -d CLOCK to monitor the frequency and I found the frequency is already the highest 1410Mhz but in ncu’s report the SM Frequency is still about 1.10. The Single Preicison Roofline is still about 14T.
I did another test on A5000. Profiled another application and monitored frequency. The frequency in nvidia-smi is the highest 2100Mhz and the SM Frequency in ncu report is 1.67. However, the Single Precision Roofline is about 27.4 TFLOPS which matches the specifications of A5000 27.7TFLOPS.