The TC unit of cc7.0 devices was designed to have that kind of hardware behavior, for at least one of the supported hardware paths. “Why is it that way?” questions can be difficult to answer. You will find with a bit of searching that the performance characteristic of that particular tensor op form varies significantly depending on GPU arch, so it seems clear to me that the GPU designers had different ideas as GPU development and TC development progressed.
The calculation produces a (four) m8n8k4 matrix-matrix multiply. There is only one correct behavior for that statement. Beyond that, I don’t know of any detailed specifications for TC unit behavior.