Tensor core metrics not showing up in NSight?

I am trying to find out which layers in my model are using tensor cores, and which are not. I followed the instructions in this post to use NVIDIA NSight Profiler (nsys) on a simple PyTorch model.

main.py:

import torch
import torch.nn as nn
import torchvision.models as models
 
# setup
device = 'cuda:0'
model = models.resnet18().half().to(device)
data = torch.randn(64, 3, 224, 224, device=device).half()
target = torch.randint(0, 1000, (64,), device=device).half()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
 
nb_iters = 20
warmup_iters = 10
for i in range(nb_iters):
   optimizer.zero_grad()
 
   # start profiling after 10 warmup iterations
   if i == warmup_iters: torch.cuda.cudart().cudaProfilerStart()
 
   # push range for current iteration
   if i >= warmup_iters: torch.cuda.nvtx.range_push("iteration{}".format(i))
 
   # push range for forward
   if i >= warmup_iters: torch.cuda.nvtx.range_push("forward")
   output = model(data)
   if i >= warmup_iters: torch.cuda.nvtx.range_pop()
 
   # pop iteration range
   if i >= warmup_iters: torch.cuda.nvtx.range_pop()
 
torch.cuda.cudart().cudaProfilerStop()

Here is the command I used to run NSight:
nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s cpu --capture-range=cudaProfilerApi --stop-on-range-end=true --cudabacktrace=true -x true -o my_profile python main.py

It produces a profile file, which I opened in the NSight viewer. Below is what I see in the viewer.

I am trying to figure out how to see which layers are using tensor cores. I clicked on every menu I could find, but I haven’t yet figured out how to do this. Any advice on how to see which layers are using tensor cores?

One other thing: In this youtube video on NSight, there is “GPU Metrics” section. This is missing from my viewer.

System details:

  • Driver version: 515
  • NSight version (nsys –version): NVIDIA Nsight Systems version 2021.3.2.12-9700a21
  • NSight viewer version: Version: 2022.1.3.3-1c7b5f7 Linux.
  • GPU: NVIDIA Titan RTX (similar to V100)
1 Like

It occurred to me that maybe the problem is that I didn’t use the --gpu-metrics-set flag.

To figure out the right value of the flag, I looked at…

$ nsys profile --gpu-metrics-set=help

Possible --gpu-metrics-set values are:
        [0] [tu10x]        General Metrics for NVIDIA TU10x (any frequency)
        [1] [tu11x]        General Metrics for NVIDIA TU11x (any frequency)
        [2] [ga100]        General Metrics for NVIDIA GA100 (any frequency)
        [3] [ga10x]        General Metrics for NVIDIA GA10x (any frequency)
        [4] [tu10x-gfxt]   Graphics Throughput Metrics for NVIDIA TU10x (frequency >= 10kHz)
        [5] [ga10x-gfxt]   Graphics Throughput Metrics for NVIDIA GA10x (frequency >= 10kHz)
        [6] [ga10x-gfxact] Graphics Async Compute Triage Metrics for NVIDIA GA10x (frequency >= 10kHz)

My Titan RTX is a TU102 (aka tu10x) GPU, so I think 0 is the right value.

So, I tried adding --gpu-metrics-set 0 to my command. Unfortunately, this didn’t add any new information to the NSight viewer window.

I’m still stuck on the problem that I described in the original post.

I think I should be using NSight Compute (ncu) instead of NSight Systems (nsys) to collect these metrics. I’m trying that.

Nsight Compute will give you tensor core (or rather tensor pipeline) utilization metrics on a per-kernel or per-range level, but not with time-correlated granularity, i.e. how values change over the runtime of your CUDA kernel. Which tool you want to use depends on your use case and needs.

Your version of Nsight Systems is very old, I would start with updating that. I think you’ll need that to really get gpu-metrics correctly.

Here’s a hint from the documentation that will come out in our next version that will help.

Note: Tensor Core: If you run nsys profile --gpu-metrics-device all, the Tensor Core utilization can be found in the GUI under the SM instructions/Tensor Active row.

Please note that it is not practical to expect a CUDA kernel to reach 100% Tensor Core utilization since there are other overheads. In general, the more computation-intensive an operation is, the higher Tensor Core utilization rate the CUDA kernel can achieve.

Excellent - thank you!!!

I updated to the latest nsys and nsys-ui (version 2022.5), and now these things show up in the plot!

2 Likes

Hi, I am doing profiling on A100 with nsys. But It shows that

$ nsys profile --gpu-metrics-set=ga10x-gfxact ./test
Illegal --gpu-metrics-set argument: ga10x-gfxact.
Metric set is not supported by GPU 0.
Use the '--gpu-metrics-set=help' switch to see the full list of values.

usage: nsys profile [<args>] [application] [<application args>]
Try 'nsys profile --help' for more information.

The available GPU profile metrics are

~$ nsys profile --gpu-metrics-set=help
Possible --gpu-metrics-set values are:
[0] [tu10x]        General Metrics for NVIDIA TU10x (any frequency)
[1] [tu11x]        General Metrics for NVIDIA TU11x (any frequency)
[2] [ga100]        General Metrics for NVIDIA GA100 (any frequency)
[3] [ga10x]        General Metrics for NVIDIA GA10x (any frequency)
[4] [gh100]        General Metrics for NVIDIA GH100 (any frequency)
[5] [ad10x]        General Metrics for NVIDIA AD10x (any frequency)
[6] [tu10x-gfxt]   Graphics Throughput Metrics for NVIDIA TU10x (frequency >= 10kHz)
[7] [ga10x-gfxt]   Graphics Throughput Metrics for NVIDIA GA10x (frequency >= 10kHz)
[8] [ga10x-gfxact] Graphics Async Compute Triage Metrics for NVIDIA GA10x (frequency >= 10kHz)
[9] [ga10b]        General Metrics for NVIDIA GA10B (any frequency)

My environment is

GPU:
65:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
ca:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)

Driver Version:
Fri May 17 15:22:25 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  On   | 00000000:65:00.0 Off |                    0 |
| N/A   34C    P0    34W / 250W |      2MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCI...  On   | 00000000:CA:00.0 Off |                    0 |
| N/A   63C    P0   188W / 250W |  25774MiB / 40960MiB |     75%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    1   N/A  N/A     60047      C   ...envs/python310/bin/python    25772MiB |
+-----------------------------------------------------------------------------+

CUDA Version:
nvcc: NVIDIA (R) Cuda compiler driver
      Copyright (c) 2005-2022 NVIDIA Corporation
      Built on Mon_Oct_24_19:12:58_PDT_2022
      Cuda compilation tools, release 12.0, V12.0.76
      Build cuda_12.0.r12.0/compiler.31968024_0

Nsys Version: NVIDIA Nsight Systems version 2022.4.2.18-32044700v0

Why I can not use the ga10x-gfxact metric?

@pkovalenko can you help with this.

GA100 is not GA10x. GA10x denotes consumer desktop Ampere chips: GA102, GA104, etc. Which metrics do you need that are not available in ga100 metric set?

Hi, I tried to use ga100 metric set to profile my program:

nsys profile  --gpu-metrics-device=0 --gpu-metrics-set=ga100 ./test

After profiling I downloaded the .nsys-rep file and open it in Windows NVIDIA Nsight System GUI, it looks as

The metrics are not detailed for example some metrics such as FMA throughput is not displayed.

I do the same profiling in my RTX 3070 by

nsys profile  --gpu-metrics-device=0 --gpu-metrics-set=ga10x-gfxact ./test

and the metrics include SM Instruction throughtputs and many other details,

I want to show these metrics on A100 GPU, what should I do?