Tensor core metrics not showing up in NSight?

I am trying to find out which layers in my model are using tensor cores, and which are not. I followed the instructions in this post to use NVIDIA NSight Profiler (nsys) on a simple PyTorch model.

main.py:

import torch
import torch.nn as nn
import torchvision.models as models
 
# setup
device = 'cuda:0'
model = models.resnet18().half().to(device)
data = torch.randn(64, 3, 224, 224, device=device).half()
target = torch.randint(0, 1000, (64,), device=device).half()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
 
nb_iters = 20
warmup_iters = 10
for i in range(nb_iters):
   optimizer.zero_grad()
 
   # start profiling after 10 warmup iterations
   if i == warmup_iters: torch.cuda.cudart().cudaProfilerStart()
 
   # push range for current iteration
   if i >= warmup_iters: torch.cuda.nvtx.range_push("iteration{}".format(i))
 
   # push range for forward
   if i >= warmup_iters: torch.cuda.nvtx.range_push("forward")
   output = model(data)
   if i >= warmup_iters: torch.cuda.nvtx.range_pop()
 
   # pop iteration range
   if i >= warmup_iters: torch.cuda.nvtx.range_pop()
 
torch.cuda.cudart().cudaProfilerStop()

Here is the command I used to run NSight:
nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s cpu --capture-range=cudaProfilerApi --stop-on-range-end=true --cudabacktrace=true -x true -o my_profile python main.py

It produces a profile file, which I opened in the NSight viewer. Below is what I see in the viewer.

I am trying to figure out how to see which layers are using tensor cores. I clicked on every menu I could find, but I haven’t yet figured out how to do this. Any advice on how to see which layers are using tensor cores?

One other thing: In this youtube video on NSight, there is “GPU Metrics” section. This is missing from my viewer.

System details:

  • Driver version: 515
  • NSight version (nsys –version): NVIDIA Nsight Systems version 2021.3.2.12-9700a21
  • NSight viewer version: Version: 2022.1.3.3-1c7b5f7 Linux.
  • GPU: NVIDIA Titan RTX (similar to V100)
1 Like

It occurred to me that maybe the problem is that I didn’t use the --gpu-metrics-set flag.

To figure out the right value of the flag, I looked at…

$ nsys profile --gpu-metrics-set=help

Possible --gpu-metrics-set values are:
        [0] [tu10x]        General Metrics for NVIDIA TU10x (any frequency)
        [1] [tu11x]        General Metrics for NVIDIA TU11x (any frequency)
        [2] [ga100]        General Metrics for NVIDIA GA100 (any frequency)
        [3] [ga10x]        General Metrics for NVIDIA GA10x (any frequency)
        [4] [tu10x-gfxt]   Graphics Throughput Metrics for NVIDIA TU10x (frequency >= 10kHz)
        [5] [ga10x-gfxt]   Graphics Throughput Metrics for NVIDIA GA10x (frequency >= 10kHz)
        [6] [ga10x-gfxact] Graphics Async Compute Triage Metrics for NVIDIA GA10x (frequency >= 10kHz)

My Titan RTX is a TU102 (aka tu10x) GPU, so I think 0 is the right value.

So, I tried adding --gpu-metrics-set 0 to my command. Unfortunately, this didn’t add any new information to the NSight viewer window.

I’m still stuck on the problem that I described in the original post.

I think I should be using NSight Compute (ncu) instead of NSight Systems (nsys) to collect these metrics. I’m trying that.

Nsight Compute will give you tensor core (or rather tensor pipeline) utilization metrics on a per-kernel or per-range level, but not with time-correlated granularity, i.e. how values change over the runtime of your CUDA kernel. Which tool you want to use depends on your use case and needs.

Your version of Nsight Systems is very old, I would start with updating that. I think you’ll need that to really get gpu-metrics correctly.

Here’s a hint from the documentation that will come out in our next version that will help.

Note: Tensor Core: If you run nsys profile --gpu-metrics-device all, the Tensor Core utilization can be found in the GUI under the SM instructions/Tensor Active row.

Please note that it is not practical to expect a CUDA kernel to reach 100% Tensor Core utilization since there are other overheads. In general, the more computation-intensive an operation is, the higher Tensor Core utilization rate the CUDA kernel can achieve.

Excellent - thank you!!!

I updated to the latest nsys and nsys-ui (version 2022.5), and now these things show up in the plot!

2 Likes