Hi,
How is that the same tensor cores are able to support different precision MMA?
Why does a single MMA instruction take twice as many clock cycles as when a large dtype is used?
Is there a ciruit diagram for tensor cores? It is just a solid block in the GPU architecture whitepapers.