I am trying to find out which layers in my model are using tensor cores, and which are not. I followed the instructions in this post to use NVIDIA NSight Profiler (nsys) on a simple PyTorch model.
main.py:
import torch
import torch.nn as nn
import torchvision.models as models
# setup
device = 'cuda:0'
model = models.resnet18().half().to(device)
data = torch.randn(64, 3, 224, 224, device=device).half()
target = torch.randint(0, 1000, (64,), device=device).half()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
nb_iters = 20
warmup_iters = 10
for i in range(nb_iters):
optimizer.zero_grad()
# start profiling after 10 warmup iterations
if i == warmup_iters: torch.cuda.cudart().cudaProfilerStart()
# push range for current iteration
if i >= warmup_iters: torch.cuda.nvtx.range_push("iteration{}".format(i))
# push range for forward
if i >= warmup_iters: torch.cuda.nvtx.range_push("forward")
output = model(data)
if i >= warmup_iters: torch.cuda.nvtx.range_pop()
# pop iteration range
if i >= warmup_iters: torch.cuda.nvtx.range_pop()
torch.cuda.cudart().cudaProfilerStop()
Here is the command I used to run NSight:
nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s cpu --capture-range=cudaProfilerApi --stop-on-range-end=true --cudabacktrace=true -x true -o my_profile python main.py
It produces a profile file, which I opened in the NSight viewer. Below is what I see in the viewer.
I am trying to figure out how to see which layers are using tensor cores. I clicked on every menu I could find, but I haven’t yet figured out how to do this. Any advice on how to see which layers are using tensor cores?
One other thing: In this youtube video on NSight, there is “GPU Metrics” section. This is missing from my viewer.
System details:
- Driver version: 515
- NSight version (nsys –version): NVIDIA Nsight Systems version 2021.3.2.12-9700a21
- NSight viewer version: Version: 2022.1.3.3-1c7b5f7 Linux.
- GPU: NVIDIA Titan RTX (similar to V100)
1 Like
It occurred to me that maybe the problem is that I didn’t use the --gpu-metrics-set
flag.
To figure out the right value of the flag, I looked at…
$ nsys profile --gpu-metrics-set=help
Possible --gpu-metrics-set values are:
[0] [tu10x] General Metrics for NVIDIA TU10x (any frequency)
[1] [tu11x] General Metrics for NVIDIA TU11x (any frequency)
[2] [ga100] General Metrics for NVIDIA GA100 (any frequency)
[3] [ga10x] General Metrics for NVIDIA GA10x (any frequency)
[4] [tu10x-gfxt] Graphics Throughput Metrics for NVIDIA TU10x (frequency >= 10kHz)
[5] [ga10x-gfxt] Graphics Throughput Metrics for NVIDIA GA10x (frequency >= 10kHz)
[6] [ga10x-gfxact] Graphics Async Compute Triage Metrics for NVIDIA GA10x (frequency >= 10kHz)
My Titan RTX is a TU102 (aka tu10x) GPU, so I think 0 is the right value.
So, I tried adding --gpu-metrics-set 0
to my command. Unfortunately, this didn’t add any new information to the NSight viewer window.
I’m still stuck on the problem that I described in the original post.
I think I should be using NSight Compute (ncu)
instead of NSight Systems (nsys)
to collect these metrics. I’m trying that.
Nsight Compute will give you tensor core (or rather tensor pipeline) utilization metrics on a per-kernel or per-range level, but not with time-correlated granularity, i.e. how values change over the runtime of your CUDA kernel. Which tool you want to use depends on your use case and needs.
Your version of Nsight Systems is very old, I would start with updating that. I think you’ll need that to really get gpu-metrics correctly.
Here’s a hint from the documentation that will come out in our next version that will help.
Note: Tensor Core: If you run nsys profile --gpu-metrics-device all, the Tensor Core utilization can be found in the GUI under the SM instructions/Tensor Active row.
Please note that it is not practical to expect a CUDA kernel to reach 100% Tensor Core utilization since there are other overheads. In general, the more computation-intensive an operation is, the higher Tensor Core utilization rate the CUDA kernel can achieve.
Excellent - thank you!!!
I updated to the latest nsys
and nsys-ui
(version 2022.5), and now these things show up in the plot!
2 Likes
Hi, I am doing profiling on A100 with nsys. But It shows that
$ nsys profile --gpu-metrics-set=ga10x-gfxact ./test
Illegal --gpu-metrics-set argument: ga10x-gfxact.
Metric set is not supported by GPU 0.
Use the '--gpu-metrics-set=help' switch to see the full list of values.
usage: nsys profile [<args>] [application] [<application args>]
Try 'nsys profile --help' for more information.
The available GPU profile metrics are
~$ nsys profile --gpu-metrics-set=help
Possible --gpu-metrics-set values are:
[0] [tu10x] General Metrics for NVIDIA TU10x (any frequency)
[1] [tu11x] General Metrics for NVIDIA TU11x (any frequency)
[2] [ga100] General Metrics for NVIDIA GA100 (any frequency)
[3] [ga10x] General Metrics for NVIDIA GA10x (any frequency)
[4] [gh100] General Metrics for NVIDIA GH100 (any frequency)
[5] [ad10x] General Metrics for NVIDIA AD10x (any frequency)
[6] [tu10x-gfxt] Graphics Throughput Metrics for NVIDIA TU10x (frequency >= 10kHz)
[7] [ga10x-gfxt] Graphics Throughput Metrics for NVIDIA GA10x (frequency >= 10kHz)
[8] [ga10x-gfxact] Graphics Async Compute Triage Metrics for NVIDIA GA10x (frequency >= 10kHz)
[9] [ga10b] General Metrics for NVIDIA GA10B (any frequency)
My environment is
GPU:
65:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
ca:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
Driver Version:
Fri May 17 15:22:25 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... On | 00000000:65:00.0 Off | 0 |
| N/A 34C P0 34W / 250W | 2MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-PCI... On | 00000000:CA:00.0 Off | 0 |
| N/A 63C P0 188W / 250W | 25774MiB / 40960MiB | 75% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 1 N/A N/A 60047 C ...envs/python310/bin/python 25772MiB |
+-----------------------------------------------------------------------------+
CUDA Version:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Mon_Oct_24_19:12:58_PDT_2022
Cuda compilation tools, release 12.0, V12.0.76
Build cuda_12.0.r12.0/compiler.31968024_0
Nsys Version: NVIDIA Nsight Systems version 2022.4.2.18-32044700v0
Why I can not use the ga10x-gfxact
metric?
@pkovalenko can you help with this.
GA100 is not GA10x. GA10x denotes consumer desktop Ampere chips: GA102, GA104, etc. Which metrics do you need that are not available in ga100 metric set?
Hi, I tried to use ga100 metric set to profile my program:
nsys profile --gpu-metrics-device=0 --gpu-metrics-set=ga100 ./test
After profiling I downloaded the .nsys-rep
file and open it in Windows NVIDIA Nsight System GUI, it looks as
The metrics are not detailed for example some metrics such as FMA throughput is not displayed.
I do the same profiling in my RTX 3070 by
nsys profile --gpu-metrics-device=0 --gpu-metrics-set=ga10x-gfxact ./test
and the metrics include SM Instruction throughtputs and many other details,
I want to show these metrics on A100 GPU, what should I do?