Hello, I captured a gem kernel (m=512, n=128256, k=16384) on an H100 machine using TensorRT,kernel name: nvjet_hsh_384x8_64x4_2x1_v_bz_TNT and it executed HGMMA.64x8x16.F32 262668288 times. Could you please explain how this number of executions was calculated?
My understanding is that the number of executions of HGMMA.64x8x16.F32 = (m x n x k) / (64 x 8 x 16), but this result is twice as high as the previous one. Could you please explain how the number of instructions was calculated?