Nsight Compute on Hopper: Is TMA Traffic Reflected in Device Memory (DRAM) Metrics?

202476410arsmart · June 26, 2025, 2:41am

Hello everyone,

I’m observing some memory traffic patterns on a Hopper GPU that I find confusing and would appreciate some clarification on.

My Environment:

GPU: Hopper Architecture
Framework: PyTorch
Software: CUDA 12.4, Driver 550, Nsight Compute (Version 2025.1, a recent version)

Observation: When profiling a GEMM kernel, I’m seeing a significant amount of byte traffic reported in the L1/TEX Cache section for metrics like Global Load and Global Load To Shared Store (Bypass).

However, when I look at the Device Memory section at the bottom of the report, which I understand to represent DRAM traffic, the corresponding Load and Store byte counts are dramatically lower, and sometimes even zero.

(You can attach your screenshot image_8d5ff4.png here in the forum post)

My Hypothesis & Question:

My main hypothesis is that these memory operations are being handled by Hopper’s Tensor Memory Accelerator (TMA). It appears that the traffic initiated by TMA is being accounted for in the L1-level metrics, but it seems to be missing from the final Device Memory (DRAM) traffic summary.

Is this the expected behavior for ncu on Hopper? Does the Device Memory section intentionally exclude TMA-initiated traffic (perhaps assuming it’s serviced by the L2 cache without hitting DRAM)? Or could this be a potential reporting issue where TMA traffic is not being fully aggregated into the top-level DRAM statistics?

Side Note: Interestingly, for the small GEMM sizes I’m testing, PyTorch is launching a kernel compiled for sm80 (Ampere) onto the Hopper hardware. I’m not sure if this could be a contributing factor to how these performance counters are reported.

Thanks in advance for any insights.

My test python code:

import torch
import sys

# 检查是否有可用的CUDA设备
if not torch.cuda.is_available():
    print("错误: 未找到CUDA设备。")
    sys.exit(1)
device = torch.device("cuda")

# 检查命令行参数
if len(sys.argv) != 4:
    print(f"用法: python3 {sys.argv[0]} <M> <N> <K>", file=sys.stderr)
    sys.exit(1)

try:
    # 从命令行读取M, N, K
    M, N, K = map(int, sys.argv[1:])
    
    # 在GPU上创建两个半精度随机矩阵
    A = torch.randn(M, K, device=device, dtype=torch.half)
    B = torch.randn(K, N, device=device, dtype=torch.half)
    
    # 执行乘法，并进行同步以确保ncu可以捕捉到它
    torch.cuda.synchronize()
    C = torch.matmul(A, B)
    torch.cuda.synchronize()

except Exception as e:
    # 打印错误到标准错误流，这样不会污染ncu的输出
    print(f"执行过程中发生错误: {e}", file=sys.stderr)
    sys.exit(1)

command:

python gemm.py 128 512 32

DLRM-0_GEMM2_MLN.nsight-cuprof-report.zip (3.5 MB)

Topic		Replies	Views
Benchmarking Different Memory Access Patterns CUDA Programming and Performance	6	1745	June 11, 2008
problem about "gld 64b" reported by profilier the number is different from what I expect CUDA Programming and Performance	2	3824	June 12, 2010
CUDA profiler & T10P CUDA Programming and Performance	3	875	May 15, 2008
VisualProfiler ver 2.2 CUDA Programming and Performance	13	4888	April 10, 2009
NVCC Compling question, where is the lmem? CUDA Programming and Performance	5	1481	March 4, 2011
From low end GPUs to high end GPUs Moving from 9600GT to Tesla T10 provides no improvement, why ? CUDA Programming and Performance	24	17380	June 8, 2010
How to count memory requests? as reported in nsight analysis CUDA Programming and Performance	0	846	May 31, 2012
Where is code stored on the device? And other interesting questions. CUDA Programming and Performance	4	2527	May 4, 2007
Local memory - no requests but numerous transactions CUDA Programming and Performance	9	1154	January 10, 2014
Device memory in nvidia visual profiler CUDA Programming and Performance	1	739	October 10, 2015

Nsight Compute on Hopper: Is TMA Traffic Reflected in Device Memory (DRAM) Metrics?

Related topics