Hi All,
I was hoping someone could clear my understanding about whether or not there is a correlation between pipe throttles and tensor pipe throughput. I have a gemm kernel: and in the summary section, ncu reports math pipe throttles as follows:
On average, each warp of this workload spends 4.4 cycles being stalled waiting for the execution pipe to be available. This stall occurs when all active warps execute their next instruction on a specific, oversubscribed math pipeline. Try to increase the number of active warps to hide the existent latency or try changing the instruction mix to utilize all available pipelines in a more balanced way. This stall type represents about 47.6% of the total average of 9.2 cycles between issuing two instructions.
In the source view, I switch to `stall_math` and stall_math(not issued) ,and as expected, ncu displays its yellow triangle in front of the HMMA instruction. (as seen in the attached screenshot).
So I have 6 ldsm instructions (in total) followed by 64 HMMAM1688 per warp (I have 8 warps in total), in the first-in-first-out order, that is I issue all my ldsms and then issue all my mmainstructions. After each pair of ldsm. I can perform 8 MMAs, so I hope that I am very reasonably able to overlap mmas and ldsm (i.e. while the mmas enabled by the first pair of ldsm runs, the remaining pair of ldsm can be completed in the background and the loop continues).
I now understand that maybe the mma instruction queue depth is not sufficient to accommodate 64 mma instructions, and therefore, warp does not issue the instruction.
However, I am not seeing the peak tensor core utilization at 100% either. If my warp is awaiting the pipe to be available, I would expect the tensor core to achieve peak flop per cycle. However, it only achieves around 80% (in contrast, cublasLT achieves 97% on the same hardware).
I was wondering if my expectation is correct, and what the correlation between the two is, and what I can do to achieve 100% tensor core utilization.
Thanks a lot !

