Number of floating point operations in one HMMA instruction

Hi, I’ve been playing with the Tensor Core recently and find something quite confusing.

What is the number of floating-point operations in one HMMA instruction, or how does NsightCompute collect that?

Here is the test code I have, which includes only one mma instruction and 128 bytes of data accessed from the VRAM.

__global__ void
test_mmatf32(float *result) {
  float a[4];
  float b[2];
  float c[4];

  a[0] = 1; a[1] = 2;
  a[2] = 3; a[3] = 4;

  b[0] = 5; b[1] = 6;

  c[0] = 0; c[1] = 0; c[2] = 0; c[3] = 0;
  
  __syncthreads();

  uint32_t const *pA = reinterpret_cast<uint32_t const *>(&a);
  uint32_t const *pB = reinterpret_cast<uint32_t const *>(&b);

  asm volatile("mma.sync.aligned.m16n8k8.row.col.f32.tf32.tf32.f32 "
               "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13}; \n"
               : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3])
               : "r"(pA[0]), "r"(pA[1]), "r"(pA[2]), "r"(pA[3]),
                 "r"(pB[0]), "r"(pB[1]), "f"(c[0]), "f"(c[1]), 
                 "f"(c[2]), "f"(c[3]));
  __syncthreads();

  result[threadIdx.x] = c[0];
}

int main() {
  float *result;
  cudaMalloc(&result, 128 * sizeof(float));
  test_mmatf32<<<1, 32>>>(result);
  return 0;
}

The arithmetic intensity [flop/byte] I expected would be 16 * 8 * 8 / 128 = 8.
However, the NsightCompute suggests that it is 2. (collected with command ncu --set full or ncu --set roofline on A100-SXM4-80GB, CUDA Version: 11.7)

So, did ncu count four floating-point operations in TF32 as one normal floating-point operation, or did some optimization happen inside the hardware?

How do I read those numbers?

Thanks.

Cu Cui



Hi, @cuic3

Sorry for the late response.

There are too many variants of the MMA instruction and the answer differs per variant and per architecture.

There are metrics for calculating the FLOPs.

ncu --query-metrics | grep sm__ops_

sm__ops_path_tensor_src_bf16_dst_fp32 Counter # of math ops executed in Tensor path with source BF16 and
sm__ops_path_tensor_src_bf16_dst_fp32_sparsity_off Counter # of math ops executed in Tensor path with source BF16 and
sm__ops_path_tensor_src_bf16_dst_fp32_sparsity_on Counter # of math ops executed in Tensor path with source BF16 and
sm__ops_path_tensor_src_fp16_dst_fp16 Counter # of math ops executed in Tensor path with source FP16 and
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off Counter # of math ops executed in Tensor path with source FP16 and
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_on Counter # of math ops executed in Tensor path with source FP16 and
sm__ops_path_tensor_src_fp16_dst_fp32 Counter # of math ops executed in Tensor path with source FP16 and
sm__ops_path_tensor_src_fp16_dst_fp32_sparsity_off Counter # of math ops executed in Tensor path with source FP16 and
sm__ops_path_tensor_src_fp16_dst_fp32_sparsity_on Counter # of math ops executed in Tensor path with source FP16 and
sm__ops_path_tensor_src_fp64 Counter # of math ops executed in Tensor path with source FP64
sm__ops_path_tensor_src_int1 Counter # of math ops executed in Tensor path with source INT1
sm__ops_path_tensor_src_int4 Counter # of math ops executed in Tensor path with source INT4
sm__ops_path_tensor_src_int4_sparsity_off Counter # of math ops executed in Tensor path with source INT4 with sparsity
sm__ops_path_tensor_src_int4_sparsity_on Counter # of math ops executed in Tensor path with source INT4 with sparsity
sm__ops_path_tensor_src_int8 Counter # of math ops executed in Tensor path with source INT8
sm__ops_path_tensor_src_int8_sparsity_off Counter # of math ops executed in Tensor path with source INT8 with sparsity
sm__ops_path_tensor_src_int8_sparsity_on Counter # of math ops executed in Tensor path with source INT8 with sparsity
sm__ops_path_tensor_src_tf32_dst_fp32 Counter # of math ops executed in Tensor path with source TF32 and
sm__ops_path_tensor_src_tf32_dst_fp32_sparsity_off Counter # of math ops executed in Tensor path with source TF32 and
sm__ops_path_tensor_src_tf32_dst_fp32_sparsity_on Counter # of math ops executed in Tensor path with source TF32 and

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.