Hi, I profiled a simple matrix-multiplication in pytorch and the flops got from ncu report is less than the theoretical peak flops.
The code I tested on A100 is the following
import torch n=4096 x = torch.ones((n, n), dtype=torch.float32, device="cuda") y = torch.ones((n, n),dtype=torch.float32, device="cuda") # run the computation part for i in range(200): if i % 100 == 0: print(i) torch.mm(x, y)
When I move the cursor to the roofline chart in ncu report, it says the performance is about 14TFLOPs. But the peak performance of A100 should be about 19TFLOPS. I check the source code for ncu roofline section in
sections/SpeedOfLight_Roofline.py. The peak performance is
2 * action.metric_by_name("sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained").as_double().
Actually, I use dcgmi to test the above benchmark code and I’m able to get the 19TFLOPs. So I’m confused by the results.
Could you explain the meaning of
sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained and why it is less than the theoretical flops?