Tensor core circuitry and operation

Hi,

How is that the same tensor cores are able to support different precision MMA?

Why does a single MMA instruction take twice as many clock cycles as when a large dtype is used?

Is there a ciruit diagram for tensor cores? It is just a solid block in the GPU architecture whitepapers.