Roofline Tensor Core should be half but not float?


a40-lamma3-8B-FFN-b1-s8192-h4096-in14336-roofline.zip (205.4 KB)

import torch

# 设置参数
b = 1  # batch size
s = 8192  # sequence length
h = 4096  # hidden size=4096, intermediate_size = 14336
intermediate_size = 14336  # intermediate size for FFN layers
dtype = torch.half  # 使用 half 精度

# 检查是否有可用的 GPU 并设置设备
device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")

# 获取当前设备的总显存量
if torch.cuda.is_available():
    total_memory = torch.cuda.get_device_properties(device).total_memory / 1024 / 1024
    print(f"Total GPU memory: {total_memory:.2f} MB")
else:
    print("No GPU available")

# 初始化计时器和内存统计
torch.cuda.reset_peak_memory_stats(device)

# 初始化矩阵并转换为 half precision 并移动到 GPU
A = torch.randn(b * s, h, dtype=dtype, device=device)
B = torch.randn(h, intermediate_size, dtype=dtype, device=device)
C = torch.randn(intermediate_size, h, dtype=dtype, device=device)

# 矩阵乘法1
M1 = A @ B

# 矩阵乘法2
M2 = M1 @ C

I am wondering, I am using half here, but I only find one Tensor core related graph, but it shows float. Maybe this mean “half”?

Hi,@202476410arsmart

Did you notice there is a dropdown here you can select the roofline chart ?

1 Like

The current Tensor Core roofline is designed to only support GV100 which only supports 1 format: FP16. A future version of Nsight Compute will support all data types. In the mean time operation counts can be collected using the metrics

ncu --query-metrics | grep sm__ops_

For example, the baseline metric names are of the form.

sm__ops_path_tensor_src_bf16_dst_fp32
sm__ops_path_tensor_src_bf16_dst_fp32_sparsity_off
sm__ops_path_tensor_src_bf16_dst_fp32_sparsity_on

The following metrics can be useful:

sm__ops_path_tensor{…}.sum - count of operations
sm__ops_path_tensor{…}.sum.per_second - operations/sec
sm__ops_path_tensor{…}.avg.pct_of_peak_sustained_elapsed - % of maximum throughput

1 Like

Thank you!!!

I measured and be sure that currently, roofline tensor core arithmetic intensity is equal to:

sm__inst_executed_pipe_tensor.sum * 512 / dram__bytes.sum

But not sm__ops_path_tensor_src_f16_dst_f16_sparsity_off

(when I am using f16->f16)

Do you know the difference here?