I have a CTA running on a single SM with 4 warps. Each warp needs to execute 64 MMA instructions. Assuming each MMA instruction takes 16 cycles, how many cycles in total will be required to complete these MMA instructions?
Additionally, how does tensor core parallelism work? I know that an SM has four tensor cores, and instructions like m16n16k8, f16f16f16 are used. How many MMA instructions can run simultaneously on the tensor cores?
This metric can tell you the total cycles the tensor cores are active for all SMs.
In my understanding, the four tensor cores in an SM can run in parallel, so 4 MMA instructions can run simultaneously per SM. You can also verify this with the metric smsp__pipe_tensor_cycles_active.sum and compare its value with sm__pipe_tensor_cycles_active.sum . These values should be equivalent.