On my RTX 2060.
When I call “mma.sync.aligned.m16n8k8.row.col.f16.f16.f16.f16” 3.200.000x(32 Threads/Warp) in a loop on one SM, I get a utilization of 49,56% inside Nsight Compute 2021.3.0.0. The program runs 6.467.014 cycles.
As I understand this is the maximum speed: 2 Tensor Cores/Partition x 4 Partitions x 64 F16 FMA Operations/Cycle = 512 Operations/cycle divided by (16x8x8 =) 1024 multiplications. So 2 cycles per SM per warp-wide mma instruction.
Could it be that the defined maximum pipeline utilization is wrong for Turing inside Nsight Compute?
The Tensor (Int) Pipeline goes up to 100%.
Or have I chosen a too slow operation and there are floating-point ones with 1 op/cycle?