Hi, I’ve been playing with the Tensor Core recently and find something quite confusing.
What is the number of floating-point operations in one HMMA instruction, or how does NsightCompute
collect that?
Here is the test code I have, which includes only one mma instruction and 128 bytes of data accessed from the VRAM.
__global__ void
test_mmatf32(float *result) {
float a[4];
float b[2];
float c[4];
a[0] = 1; a[1] = 2;
a[2] = 3; a[3] = 4;
b[0] = 5; b[1] = 6;
c[0] = 0; c[1] = 0; c[2] = 0; c[3] = 0;
__syncthreads();
uint32_t const *pA = reinterpret_cast<uint32_t const *>(&a);
uint32_t const *pB = reinterpret_cast<uint32_t const *>(&b);
asm volatile("mma.sync.aligned.m16n8k8.row.col.f32.tf32.tf32.f32 "
"{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13}; \n"
: "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3])
: "r"(pA[0]), "r"(pA[1]), "r"(pA[2]), "r"(pA[3]),
"r"(pB[0]), "r"(pB[1]), "f"(c[0]), "f"(c[1]),
"f"(c[2]), "f"(c[3]));
__syncthreads();
result[threadIdx.x] = c[0];
}
int main() {
float *result;
cudaMalloc(&result, 128 * sizeof(float));
test_mmatf32<<<1, 32>>>(result);
return 0;
}
The arithmetic intensity [flop/byte] I expected would be 16 * 8 * 8 / 128 = 8.
However, the NsightCompute
suggests that it is 2. (collected with command ncu --set full
or ncu --set roofline
on A100-SXM4-80GB, CUDA Version: 11.7)
So, did ncu
count four floating-point operations in TF32 as one normal floating-point operation, or did some optimization happen inside the hardware?
How do I read those numbers?
Thanks.
Cu Cui