Hello,
I am trying to profile some jobs but my V100 is not able to collect gpu-metrics.
I didn’t find any doc about which GPUs are supported.
Can you help with this?
V100
$ nsys profile --gpu-metrics-device=help
Possible --gpu-metrics-device values are:
Some GPUs are not supported:
Tesla V100-SXM2-32GB PCI[0000:62:00.0]
$ nvidia-smi
Wed Aug 7 20:02:19 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla V100-SXM2-32GB Off | 00000000:62:00.0 Off | 0 |
| N/A 35C P0 53W / 300W | 0MiB / 32768MiB | 1% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
$ nsys status -e
Timestamp counter supported: Yes
CPU Profiling Environment Check
Root privilege: disabled
Linux Kernel Paranoid Level = 1
Linux Distribution = CentOS
Linux Kernel Version = 3.10.0-1127.19.1.el7.x86_64: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Available
CPU Profiling Environment (process-tree): OK
CPU Profiling Environment (system-wide): Fail
I am using a controlled environment so I don’t have sudo access.
Anyway, my Tesla-T4 is fine with GPU metrics.
$ nvidia-smi
Wed Aug 7 20:58:45 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 Off | 00000000:81:00.0 Off | 0 |
| N/A 50C P0 28W / 70W | 0MiB / 15360MiB | 5% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
$ nsys profile --gpu-metrics-device=help
Possible --gpu-metrics-device values are:
0: Tesla T4 PCI[0000:81:00.0]
all: Select all supported GPUs
none: Disable GPU Metrics [Default]