Nsight Compute on Hopper: Is TMA Traffic Reflected in Device Memory (DRAM) Metrics?

Hello everyone,

I’m observing some memory traffic patterns on a Hopper GPU that I find confusing and would appreciate some clarification on.

My Environment:

  • GPU: Hopper Architecture
  • Framework: PyTorch
  • Software: CUDA 12.4, Driver 550, Nsight Compute (Version 2025.1, a recent version)

Observation: When profiling a GEMM kernel, I’m seeing a significant amount of byte traffic reported in the L1/TEX Cache section for metrics like Global Load and Global Load To Shared Store (Bypass).

However, when I look at the Device Memory section at the bottom of the report, which I understand to represent DRAM traffic, the corresponding Load and Store byte counts are dramatically lower, and sometimes even zero.

(You can attach your screenshot image_8d5ff4.png here in the forum post)

My Hypothesis & Question:

My main hypothesis is that these memory operations are being handled by Hopper’s Tensor Memory Accelerator (TMA). It appears that the traffic initiated by TMA is being accounted for in the L1-level metrics, but it seems to be missing from the final Device Memory (DRAM) traffic summary.

Is this the expected behavior for ncu on Hopper? Does the Device Memory section intentionally exclude TMA-initiated traffic (perhaps assuming it’s serviced by the L2 cache without hitting DRAM)? Or could this be a potential reporting issue where TMA traffic is not being fully aggregated into the top-level DRAM statistics?

Side Note: Interestingly, for the small GEMM sizes I’m testing, PyTorch is launching a kernel compiled for sm80 (Ampere) onto the Hopper hardware. I’m not sure if this could be a contributing factor to how these performance counters are reported.

Thanks in advance for any insights.

My test python code:

import torch
import sys

# 检查是否有可用的CUDA设备
if not torch.cuda.is_available():
    print("错误: 未找到CUDA设备。")
    sys.exit(1)
device = torch.device("cuda")

# 检查命令行参数
if len(sys.argv) != 4:
    print(f"用法: python3 {sys.argv[0]} <M> <N> <K>", file=sys.stderr)
    sys.exit(1)

try:
    # 从命令行读取M, N, K
    M, N, K = map(int, sys.argv[1:])
    
    # 在GPU上创建两个半精度随机矩阵
    A = torch.randn(M, K, device=device, dtype=torch.half)
    B = torch.randn(K, N, device=device, dtype=torch.half)
    
    # 执行乘法,并进行同步以确保ncu可以捕捉到它
    torch.cuda.synchronize()
    C = torch.matmul(A, B)
    torch.cuda.synchronize()

except Exception as e:
    # 打印错误到标准错误流,这样不会污染ncu的输出
    print(f"执行过程中发生错误: {e}", file=sys.stderr)
    sys.exit(1)

command:

python gemm.py 128 512 32

DLRM-0_GEMM2_MLN.nsight-cuprof-report.zip (3.5 MB)