There is possibly a huge amount of information to cover here; I won’t be able to cover it all here; some research may be necessary.
Furthermore, I don’t believe NVIDIA documents TC specifics to the level that might be needed to answer every possible question.
The GPU is broken into SMs. In modern GPUs, each SM has multiple subpartitions (SMSPs). Warps are statically distributed to the SMSPs, and each SMSP has a collection of execution resources, including functional units, warp schedulers, register file, etc.
As already discussed, the CUDA C++ intrinsic TC op here (wmma::mma_sync...) seems to be decomposed by the compiler into two HMMA SASS instructions. Those instructions appear one after the other in the SASS code, and therefore in the instruction stream.
The warp schedulers in a modern SMSP are not dual-issue capable. Therefore in a single clock cycle they can only issue a single instruction. Furthermore, this is not an out-of-order machine; only the “topmost” instruction in the stream is eligible to be issued, in any particular cycle.
As a mental model, all functional units in a SM are pipelined. They can all accept a new instruction of a particular type/category serviced by that functional unit type, in any given clock cycle. (Yes, there are exceptions to this. It’s a mental model, and it is widely true. Not universally true. For example, some combinations of GPU/functional unit/instruction type may only be able to be issued once every other clock cycle.)
So when we have two HMMA instructions at different points in the instruction stream, it stands to reason that they do not necessarily require 2 TC units in order to issue. They must be issued in separate clock cycles, and based on the pipeline description already given, in those separate clock cycles, 2 HMMA instructions could be issued to the same TC unit. One gets issued in one clock cycle, and another gets issued in some later subsequent cycle.
And I am fairly convinced that is what happens here. I haven’t studied all the extant reverse-engineering papers, and I’m not suggesting that I can quote chapter and verse of NVIDIA documentation that explains exactly the mechanics by which TC instructions may be issued to TC functional units. If someone wants to correct me, please do so.
Your existing approach seems a reasonable path to me. I don’t have other suggestions. I have not studied your code in great detail, nor have I thought about this question in great detail. But with a back-to-back sequence of HMMA instructions, I would expect the dispatch rate for that sequence to be consistent with a calculation along the lines you have shown. I haven’t tried to verify your calculations, either. I would encourage you to take note of the two comments I gave already. In particular, although it comes pretty close, you have not achieved a long back-to-back sequence of HMMA instructions.
I haven’t studied the breakdown of flops to HMMA instructions carefully. If the TC unit can deliver 512 FMA ops/clk, and the given HMMA instruction (or the governing C++ intrinsic) would imply greater than 512 FMA ops per instruction, then clearly those cannot be issued back-to-back. I haven’t worked through all those calculations, but this could be a factor that impacts the observed “issue rate” which is what your code seems to determine. For example if the observed issue rate is lower by a factor of 2, but something consistent with peak throughput is delivered, then the instruction cannot be issued back-to-back. AFAIK these sorts of considerations are undocumented by NVIDIA.