Number of floating point operations in one HMMA instruction

Hi, I’ve been playing with the Tensor Core recently and find something quite confusing.

What is the number of floating-point operations in one HMMA instruction, or how does NsightCompute collect that?

Here is the test code I have, which includes only one mma instruction and 128 bytes of data accessed from the VRAM.

__global__ void
test_mmatf32(float *result) {
  float a[4];
  float b[2];
  float c[4];

  a[0] = 1; a[1] = 2;
  a[2] = 3; a[3] = 4;

  b[0] = 5; b[1] = 6;

  c[0] = 0; c[1] = 0; c[2] = 0; c[3] = 0;
  
  __syncthreads();

  uint32_t const *pA = reinterpret_cast<uint32_t const *>(&a);
  uint32_t const *pB = reinterpret_cast<uint32_t const *>(&b);

  asm volatile("mma.sync.aligned.m16n8k8.row.col.f32.tf32.tf32.f32 "
               "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13}; \n"
               : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3])
               : "r"(pA[0]), "r"(pA[1]), "r"(pA[2]), "r"(pA[3]),
                 "r"(pB[0]), "r"(pB[1]), "f"(c[0]), "f"(c[1]), 
                 "f"(c[2]), "f"(c[3]));
  __syncthreads();

  result[threadIdx.x] = c[0];
}

int main() {
  float *result;
  cudaMalloc(&result, 128 * sizeof(float));
  test_mmatf32<<<1, 32>>>(result);
  return 0;
}

The arithmetic intensity [flop/byte] I expected would be 16 * 8 * 8 / 128 = 8.
However, the NsightCompute suggests that it is 2. (collected with command ncu --set full or ncu --set roofline on A100-SXM4-80GB, CUDA Version: 11.7)

So, did ncu count four floating-point operations in TF32 as one normal floating-point operation, or did some optimization happen inside the hardware?

How do I read those numbers?

Thanks.

Cu Cui