How to measure FLOPs of a cuda kernel function by using Nsight-Compute on A100 GPU?

Hello, All.

I’m trying to get real number of Float Operations of some kernels in A100 80G GPU. I have known that some new metrics have been added in nsight-compute, such as sm__ops_path_tensor_src_fp16_dst_fp32.

For a matrix multiplication <M,K>*<K,N>=<M,N>, the theoretical FLOPs is 2MNK.

So for the matrix multiplication of <1,5120>*<5120,6192> = <1,6192>, I should get 70,778,880 = 2*1*5120*6912 FLOPs.

But the metric (smsp__ops_path_tensor_src_fp16_dst_fp32.sum) value is 4,529,848,320 (64x theoretical flops) in ncu, and the HMMA instruction number is 1,105,920 (1/64x theoretical flops).

Even more confusing is that when I calculate <2,5120>*<5120,6192> = <2,6192>, I still get same value of metric and HMMA as before in ncu, this is different from the theoretical doubling.

I test more data as following table:

Program theoretical FLOPs HMMA Ins. Num. ( Warp-level ) Tensor OPs ( smsp__ops_path_tensor_src_fp16_dst_fp32.sum )
( 1*5120*6912 GEMM) 70,778,880 = 2*1*5120*6912 1,105,920 4,529,848,320 ( 64 x )
( 2*5120*6912 GEMM) 141,557,760 = 2*2*5120*6912 1,105,920 4,529,848,320 ( 32 x )
( 4*5120*6912 GEMM) 283,115,520 = 2*4*5120*6912 1,105,920 4,529,848,320 ( 16 x )
( 8*5120*6912 GEMM) 566,231,040 = 2*8*5120*6912 1,105,920 4,529,848,320 ( 8 x )
( 16*5120*6912 GEMM) 1,132,462,080 = 2*16*5120*6912 1,105,920 4,529,848,320 ( 4 x )
( 32*5120*6912 GEMM) 2,264,924,160 = 2*32*5120*6912 1,105,920 4,529,848,320 ( 2 x )
( 64*5120*6912 GEMM) 4,529,848,320 = 2*64*5120*6912 1,105,920 4,529,848,320 ( 1 x )
( 128*5120*6912 GEMM) 9,059,696,640 = 2*128*5120*6912 3,317,760 13,588,544,960 (1.5 x, other kernel function)
( 256*5120*6912 GEMM) 18,119,393,280 = 2*256*5120*6912 4,423,680 18,119,393,280 ( 1 x )

So I have some questions following:

  1. How nsight compute get the metric value(sm__ops_path_tensor_…)? Is it calculated by some other metrics or from hardware counters?
  2. How to understand the above data? It seems that it may be related to the measurement method of nsight Compute or features of tensor core.
  3. If this difference is related to the tensor core, what specifications may it be related to?

The following is my test code:

{
import torch
import torch.nn.functional as F

def env_init():
    torch.set_default_device('cuda:1')
    torch.set_default_dtype(torch.float16)

def main():
    input_parallel=torch.randn(1,1,5120,dtype=torch.float16)
    weight=torch.randn(6912, 5120,dtype=torch.float16)
    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    start_event.record()
    for _ in range(5):
        output_parallel = F.linear(input_parallel, weight)
    end_event.record()
    torch.cuda.synchronize()
    estimate_ms = start_event.elapsed_time(end_event) / 5
    print('The estimate time is: ', estimate_ms)

if __name__ == "__main__":
    env_init()
    main()
}

Any reply from you will be helpful to me.
Thanks.

The model FLOPs and hardware FLOPs can differ significantly as the software library or hardware supports only a small set number of Matrix shapes. If the input A, B, and C matrix do not match the software or hardware then the software must zero extend the dimensions of the matrices. See Matrix Shape in PTX ISA 8.5 (nvidia.com) for supported shapes.

The kernel name for the operation may provide additional information on the expanded size. The kernel name and reviewing the SASS HMMA instruction details also are likely to provide useful information.

Sorry for missing kernel name:

Program Cuda Kennel Name
( 1*5120*6912 GEMM) void cutlass::Kernel<cutlass_80_tensorop_f16_s16816gemm_relu_f16_64x64_64x6_tn_align8>(Params)
( 2*5120*6912 GEMM) ampere_fp16_s16816gemm_fp16_64x64_sliced1x2_ldg8_f2f_stages_64x5_tn
( 4*5120*6912 GEMM) ampere_fp16_s16816gemm_fp16_64x64_sliced1x2_ldg8_f2f_stages_64x5_tn
( 8*5120*6912 GEMM) ampere_fp16_s16816gemm_fp16_64x64_sliced1x2_ldg8_f2f_stages_64x5_tn
( 16*5120*6912 GEMM) ampere_fp16_s16816gemm_fp16_64x64_sliced1x2_ldg8_f2f_stages_64x5_tn
( 32*5120*6912 GEMM) ampere_fp16_s16816gemm_fp16_128x64_sliced1x2_ldg8_f2f_stages_64x6_tn
( 64*5120*6912 GEMM) ampere_fp16_s16816gemm_fp16_128x64_sliced1x2_ldg8_f2f_stages_64x6_tn
( 65*5120*6912 GEMM) ampere_fp16_s16816gemm_fp16_256x128_ldg8_f2f_stages_64x3_tn
( 128*5120*6912 GEMM) sm80_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize96x128x32_stage4_warpsize2x2x1_tensor16x8x16_kernel
( 256*5120*6912 GEMM) ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_64x3_tn

There is another data which X dimension is 65.

Program theoretical FLOPs HMMA Ins. Num. ( Warp-level ) Tensor OPs ( smsp__ops_path_tensor_src_fp16_dst_fp32.sum )
Sample Program ( 65*5120*6912 GEMM) 4,600,627,200 = 2*65*5120*6912 2,211,840 ( 2 x ) 9,059,696,640 ( 2 x )

I want to know how can I confirm the match between matrix size ( M, N, K ) and shape (16x8x16). Such as HMMA.16816, I must keep M/16 = K/16 = N/8 or just keep M mod(16) = K mod(16) = N mod(8) = 0 .

I am more curious about why this set of data is related to 64. Just like when a dimension is less than 64, it needs to be calculated as 64, and when it exceeds 64, it needs to be calculated as an integer multiple of 64. I don’t know what this 64 is related to.