How to measure FLOPs of a cuda kernel function by using Nsight-Compute on A100 GPU?

Fresh-Programmer · August 13, 2024, 6:38am

Hello, All.

I’m trying to get real number of Float Operations of some kernels in A100 80G GPU. I have known that some new metrics have been added in nsight-compute, such as sm__ops_path_tensor_src_fp16_dst_fp32.

For a matrix multiplication <M,K>*<K,N>=<M,N>, the theoretical FLOPs is 2MNK.

So for the matrix multiplication of <1,5120>*<5120,6192> = <1,6192>, I should get 70,778,880 = 2*1*5120*6912 FLOPs.

But the metric (smsp__ops_path_tensor_src_fp16_dst_fp32.sum) value is 4,529,848,320 (64x theoretical flops) in ncu, and the HMMA instruction number is 1,105,920 (1/64x theoretical flops).

Even more confusing is that when I calculate <2,5120>*<5120,6192> = <2,6192>, I still get same value of metric and HMMA as before in ncu, this is different from the theoretical doubling.

I test more data as following table:

Program	theoretical FLOPs	HMMA Ins. Num. ( Warp-level )	Tensor OPs ( smsp__ops_path_tensor_src_fp16_dst_fp32.sum )
( 151206912 GEMM)	70,778,880 = 215120*6912	1,105,920	4,529,848,320 ( 64 x )
( 251206912 GEMM)	141,557,760 = 225120*6912	1,105,920	4,529,848,320 ( 32 x )
( 451206912 GEMM)	283,115,520 = 245120*6912	1,105,920	4,529,848,320 ( 16 x )
( 851206912 GEMM)	566,231,040 = 285120*6912	1,105,920	4,529,848,320 ( 8 x )
( 1651206912 GEMM)	1,132,462,080 = 2165120*6912	1,105,920	4,529,848,320 ( 4 x )
( 3251206912 GEMM)	2,264,924,160 = 2325120*6912	1,105,920	4,529,848,320 ( 2 x )
( 6451206912 GEMM)	4,529,848,320 = 2645120*6912	1,105,920	4,529,848,320 ( 1 x )
( 12851206912 GEMM)	9,059,696,640 = 21285120*6912	3,317,760	13,588,544,960 (1.5 x, other kernel function)
( 25651206912 GEMM)	18,119,393,280 = 22565120*6912	4,423,680	18,119,393,280 ( 1 x )

So I have some questions following:

How nsight compute get the metric value(sm__ops_path_tensor_…)? Is it calculated by some other metrics or from hardware counters?
How to understand the above data? It seems that it may be related to the measurement method of nsight Compute or features of tensor core.
If this difference is related to the tensor core, what specifications may it be related to?

The following is my test code:

{
import torch
import torch.nn.functional as F

def env_init():
    torch.set_default_device('cuda:1')
    torch.set_default_dtype(torch.float16)

def main():
    input_parallel=torch.randn(1,1,5120,dtype=torch.float16)
    weight=torch.randn(6912, 5120,dtype=torch.float16)
    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    start_event.record()
    for _ in range(5):
        output_parallel = F.linear(input_parallel, weight)
    end_event.record()
    torch.cuda.synchronize()
    estimate_ms = start_event.elapsed_time(end_event) / 5
    print('The estimate time is: ', estimate_ms)

if __name__ == "__main__":
    env_init()
    main()
}

Any reply from you will be helpful to me.
Thanks.

Greg · August 15, 2024, 8:58pm

The model FLOPs and hardware FLOPs can differ significantly as the software library or hardware supports only a small set number of Matrix shapes. If the input A, B, and C matrix do not match the software or hardware then the software must zero extend the dimensions of the matrices. See Matrix Shape in PTX ISA 8.5 (nvidia.com) for supported shapes.

The kernel name for the operation may provide additional information on the expanded size. The kernel name and reviewing the SASS HMMA instruction details also are likely to provide useful information.

Fresh-Programmer · August 16, 2024, 1:20am

Sorry for missing kernel name:

Program	Cuda Kennel Name
( 151206912 GEMM)	void cutlass::Kernel<cutlass_80_tensorop_f16_s16816gemm_relu_f16_64x64_64x6_tn_align8>(Params)
( 251206912 GEMM)	ampere_fp16_s16816gemm_fp16_64x64_sliced1x2_ldg8_f2f_stages_64x5_tn
( 451206912 GEMM)	ampere_fp16_s16816gemm_fp16_64x64_sliced1x2_ldg8_f2f_stages_64x5_tn
( 851206912 GEMM)	ampere_fp16_s16816gemm_fp16_64x64_sliced1x2_ldg8_f2f_stages_64x5_tn
( 1651206912 GEMM)	ampere_fp16_s16816gemm_fp16_64x64_sliced1x2_ldg8_f2f_stages_64x5_tn
( 3251206912 GEMM)	ampere_fp16_s16816gemm_fp16_128x64_sliced1x2_ldg8_f2f_stages_64x6_tn
( 6451206912 GEMM)	ampere_fp16_s16816gemm_fp16_128x64_sliced1x2_ldg8_f2f_stages_64x6_tn
( 6551206912 GEMM)	ampere_fp16_s16816gemm_fp16_256x128_ldg8_f2f_stages_64x3_tn
( 12851206912 GEMM)	sm80_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize96x128x32_stage4_warpsize2x2x1_tensor16x8x16_kernel
( 25651206912 GEMM)	ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_64x3_tn

There is another data which X dimension is 65.

Program	theoretical FLOPs	HMMA Ins. Num. ( Warp-level )	Tensor OPs ( smsp__ops_path_tensor_src_fp16_dst_fp32.sum )
Sample Program ( 6551206912 GEMM)	4,600,627,200 = 2655120*6912	2,211,840 ( 2 x )	9,059,696,640 ( 2 x )

I want to know how can I confirm the match between matrix size ( M, N, K ) and shape (16x8x16). Such as HMMA.16816, I must keep M/16 = K/16 = N/8 or just keep M mod(16) = K mod(16) = N mod(8) = 0 .

I am more curious about why this set of data is related to 64. Just like when a dimension is less than 64, it needs to be calculated as 64, and when it exceeds 64, it needs to be calculated as an integer multiple of 64. I don’t know what this 64 is related to.

Topic		Replies	Views
How to measure Tensor FLOPs? CUDA Programming and Performance tensorrt , cuda , kernel	14	2562	May 15, 2024
Counting FLOPs using ncu Nsight Compute	2	455	July 30, 2024
Confusion about the (d/f/h)(mul/add/fma) count in the nsight compute Nsight Compute cuda , deep-learning-profiler , profiling	6	1537	January 16, 2024
Total FLOPS for the HGMMA instruction Nsight Compute	4	710	March 21, 2024
Why the Compute Throughput's value is different from the actual Performance / Peak Performance Nsight Compute cuda , kernel , nsight , profiling	7	3025	October 28, 2022
Number of floating point operations in one HMMA instruction Nsight Compute cuda	2	1369	May 20, 2024
NSight : How to calculate FLOP/s that's close to achieved FLOP/s CUDA Programming and Performance	3	3119	October 4, 2017
Why the number of flops is different between FP32 and FP16 mode with YOLOv3 TensorRT implementation? Jetson AGX Xavier tensorrt , kernel , profiling	8	4014	October 18, 2021
How many FLOPs does one tensor_op_hmma instruction do? Nsight Compute	4	912	May 30, 2024
How does it compute exactly in Tensor Core? CUDA Programming and Performance	10	901	August 22, 2024

How to measure FLOPs of a cuda kernel function by using Nsight-Compute on A100 GPU?

Related topics