Hello, All.
I’m trying to get real number of Float Operations of some kernels in A100 80G GPU. I have known that some new metrics have been added in nsight-compute, such as sm__ops_path_tensor_src_fp16_dst_fp32.
For a matrix multiplication <M,K>*<K,N>=<M,N>, the theoretical FLOPs is 2MNK.
So for the matrix multiplication of <1,5120>*<5120,6192> = <1,6192>, I should get 70,778,880 = 2*1*5120*6912 FLOPs.
But the metric (smsp__ops_path_tensor_src_fp16_dst_fp32.sum) value is 4,529,848,320 (64x theoretical flops) in ncu, and the HMMA instruction number is 1,105,920 (1/64x theoretical flops).
Even more confusing is that when I calculate <2,5120>*<5120,6192> = <2,6192>, I still get same value of metric and HMMA as before in ncu, this is different from the theoretical doubling.
I test more data as following table:
Program | theoretical FLOPs | HMMA Ins. Num. ( Warp-level ) | Tensor OPs ( smsp__ops_path_tensor_src_fp16_dst_fp32.sum ) |
---|---|---|---|
( 1*5120*6912 GEMM) | 70,778,880 = 2*1*5120*6912 | 1,105,920 | 4,529,848,320 ( 64 x ) |
( 2*5120*6912 GEMM) | 141,557,760 = 2*2*5120*6912 | 1,105,920 | 4,529,848,320 ( 32 x ) |
( 4*5120*6912 GEMM) | 283,115,520 = 2*4*5120*6912 | 1,105,920 | 4,529,848,320 ( 16 x ) |
( 8*5120*6912 GEMM) | 566,231,040 = 2*8*5120*6912 | 1,105,920 | 4,529,848,320 ( 8 x ) |
( 16*5120*6912 GEMM) | 1,132,462,080 = 2*16*5120*6912 | 1,105,920 | 4,529,848,320 ( 4 x ) |
( 32*5120*6912 GEMM) | 2,264,924,160 = 2*32*5120*6912 | 1,105,920 | 4,529,848,320 ( 2 x ) |
( 64*5120*6912 GEMM) | 4,529,848,320 = 2*64*5120*6912 | 1,105,920 | 4,529,848,320 ( 1 x ) |
( 128*5120*6912 GEMM) | 9,059,696,640 = 2*128*5120*6912 | 3,317,760 | 13,588,544,960 (1.5 x, other kernel function) |
( 256*5120*6912 GEMM) | 18,119,393,280 = 2*256*5120*6912 | 4,423,680 | 18,119,393,280 ( 1 x ) |
So I have some questions following:
- How nsight compute get the metric value(sm__ops_path_tensor_…)? Is it calculated by some other metrics or from hardware counters?
- How to understand the above data? It seems that it may be related to the measurement method of nsight Compute or features of tensor core.
- If this difference is related to the tensor core, what specifications may it be related to?
The following is my test code:
{
import torch
import torch.nn.functional as F
def env_init():
torch.set_default_device('cuda:1')
torch.set_default_dtype(torch.float16)
def main():
input_parallel=torch.randn(1,1,5120,dtype=torch.float16)
weight=torch.randn(6912, 5120,dtype=torch.float16)
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(5):
output_parallel = F.linear(input_parallel, weight)
end_event.record()
torch.cuda.synchronize()
estimate_ms = start_event.elapsed_time(end_event) / 5
print('The estimate time is: ', estimate_ms)
if __name__ == "__main__":
env_init()
main()
}
Any reply from you will be helpful to me.
Thanks.