About the flops in ncu report

Hi, I profiled a simple matrix-multiplication in pytorch and the flops got from ncu report is less than the theoretical peak flops.

The code I tested on A100 is the following

import torch
n=4096
x = torch.ones((n, n), dtype=torch.float32, device="cuda")
y = torch.ones((n, n),dtype=torch.float32, device="cuda")
# run the computation part
for i in range(200):
    if i % 100 == 0:
        print(i)
    torch.mm(x, y)

When I move the cursor to the roofline chart in ncu report, it says the performance is about 14TFLOPs. But the peak performance of A100 should be about 19TFLOPS. I check the source code for ncu roofline section in sections/SpeedOfLight_Roofline.py. The peak performance is 2 * action.metric_by_name("sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained").as_double().

Actually, I use dcgmi to test the above benchmark code and I’m able to get the 19TFLOPs. So I’m confused by the results.

Could you explain the meaning of sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained and why it is less than the theoretical flops?

Anybody could help me confirm that? I thought it should be a bug of NCU.