Lots of further links:

I had once a similar issue, but without solution:

On my RTX 2060.
When I call “mma.sync.aligned.m16n8k8.row.col.f16.f16.f16.f16” 3.200.000x(32 Threads/Warp) in a loop on one SM, I get a utilization of 49,56% inside Nsight Compute 2021.3.0.0. The program runs 6.467.014 cycles.
As I understand this is the maximum speed: 2 Tensor Cores/Partition x 4 Partitions x 64 F16 FMA Operations/Cycle = 512 Operations/cycle divided by (16x8x8 =) 1024 multiplications. So 2 cycles per SM per warp-wide mma instruction.
Could it be that the defined maximum pip…
And this question is also related:

Hi.
I have profiled a nn.linear(1408, 1408) layer in nsight compute.
(input shape: (256, 6, 1408), output shape: (256, 6, 1408))
I used fp32 for the first profiling and it gave 73.32% of SM throughput and 63.44% of FMA pipe utilization(which seems well utilizing the compute units…).
But when I used tf32 for the same kernel(added torch.backends.cuda.matmul.allow_tf32 = True and
torch.backends.cudnn.allow_tf32 = True), SM throughput goes down to 48.22%. Also, tensor core utilization in Comput…
There was an issue (once?) with stored roofline of Tensor Cores for different GPU architectures (basically rooflines have been only correct for GV100 Volta GPUs):

or

I want to build a roofline model for my kernels. So I launch the ncu with the command
ncu --csv --target-processes all --set roofline mpirun -n 1 ./run_pselinv_linux_release_v2.0 -H H3600.csc -file ./tmpfile
The roofline set collects enough data to build the roofline model. But I can’t figure out the meaning of each metrics clearly.
The Compute(SM) Throughput is collected by the metrics sm__throughput.avg.pct_of_peak_sustained_elapsed which is 0.64%. And I think it is the percentage of …
Those detailed counters could help you calculate exact FLOPs (but only if you know some details of your instructions; it is not enough for fully unknown code to deduce FLOPs):

I’m trying to get the FLOPs of a DNN model using nsight compute. If I don’t use tensorcore, I can count the ffma fmul fadd instructions to get the FLOPs. But if I use tensorcore, can I use the counter to calculate the FLOPs of the model?
or

I am using H100 GPU 80GB DRAM. My Matrix-Matrix multiplication operation is using HGMMA to do that as seen in Nsight Compute Instruction stats section.
Is there a way to get total Floating point operations using any counter? E.g. sm__sass_inst_executed_op_shared_gmma.sum [inst].
or

Hi, @cuic3
Sorry for the late response.
There are too many variants of the MMA instruction and the answer differs per variant and per architecture.
There are metrics for calculating the FLOPs.
ncu --query-metrics | grep sm__ops_
sm__ops_path_tensor_src_bf16_dst_fp32 Counter # of math ops executed in Tensor path with source BF16 and
sm__ops_path_tensor_src_bf16_dst_fp32_sparsity_off Counter …