# Number of floating point operations in one HMMA instruction

Hi, I’ve been playing with the Tensor Core recently and find something quite confusing.

What is the number of floating-point operations in one HMMA instruction, or how does `NsightCompute` collect that?

Here is the test code I have, which includes only one mma instruction and 128 bytes of data accessed from the VRAM.

``````__global__ void
test_mmatf32(float *result) {
float a[4];
float b[2];
float c[4];

a[0] = 1; a[1] = 2;
a[2] = 3; a[3] = 4;

b[0] = 5; b[1] = 6;

c[0] = 0; c[1] = 0; c[2] = 0; c[3] = 0;

uint32_t const *pA = reinterpret_cast<uint32_t const *>(&a);
uint32_t const *pB = reinterpret_cast<uint32_t const *>(&b);

asm volatile("mma.sync.aligned.m16n8k8.row.col.f32.tf32.tf32.f32 "
"{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13}; \n"
: "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3])
: "r"(pA[0]), "r"(pA[1]), "r"(pA[2]), "r"(pA[3]),
"r"(pB[0]), "r"(pB[1]), "f"(c[0]), "f"(c[1]),
"f"(c[2]), "f"(c[3]));

}

int main() {
float *result;
cudaMalloc(&result, 128 * sizeof(float));
test_mmatf32<<<1, 32>>>(result);
return 0;
}
``````

The arithmetic intensity [flop/byte] I expected would be 16 * 8 * 8 / 128 = 8.
However, the `NsightCompute` suggests that it is 2. (collected with command `ncu --set full` or `ncu --set roofline` on A100-SXM4-80GB, CUDA Version: 11.7)

So, did `ncu` count four floating-point operations in TF32 as one normal floating-point operation, or did some optimization happen inside the hardware?

How do I read those numbers?

Thanks.

Cu Cui