Hello,
I am experiencing an issue when attempting to collect GPU metrics using Nsight Systems on a server equipped with NVIDIA A100-SXM4-80GB GPUs.
I receive the following error message indicating that some GPUs are not supported.
$ sudo nsys profile --gpu-metrics-device=help
Possible --gpu-metrics-device values are:
Some GPUs are not supported:
NVIDIA A100-SXM4-80GB PCI[0000:07:00.0]
NVIDIA A100-SXM4-80GB PCI[0000:0f:00.0]
NVIDIA A100-SXM4-80GB PCI[0000:47:00.0]
NVIDIA A100-SXM4-80GB PCI[0000:4e:00.0]
NVIDIA A100-SXM4-80GB PCI[0000:87:00.0]
NVIDIA A100-SXM4-80GB PCI[0000:90:00.0]
NVIDIA A100-SXM4-80GB PCI[0000:b7:00.0]
NVIDIA A100-SXM4-80GB PCI[0000:bd:00.0]
I have also attached the my system settings.
$ sudo nsys status -e
Timestamp counter supported: Yes
CPU Profiling Environment Check
Root privilege: enabled
Linux Kernel Paranoid Level = 4
Linux Distribution = Ubuntu
Linux Kernel Version = 5.15.0-1055-nvidia: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Not Available
CPU Profiling Environment (process-tree): OK
CPU Profiling Environment (system-wide): OK
$ sudo nsys --version
NVIDIA Nsight Systems version 2024.4.1.61-244134315967v0
$ sudo nvidia-smi --version
NVIDIA-SMI version : 550.54.15
NVML version : 550.54
DRIVER version : 550.54.15
CUDA Version : 12.4
Thank you for your assistance.
Please share a file nsys-ui.log created by nsys in the current directory after running sudo nsys profile --gpu-metrics-device=0
. To enable log, navigate to the target directory (target-linux-x64, can be found by running readlink -f `which nsys` | xargs dirname ) and rename nvlog.config.template to nvlog.config.
Here is the content of the nsys-ui.log
jsh@pnode4:/opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64$ cat nsys-ui.log
W22:46:37:720|quadd_common_core|2903490|FileSystem.cpp:172[FindInstalledFile]: File 'nsys-config.ini' is not found: /opt/nvidia/nsight-systems-cli/2024.4.1/host-linux-x64; /opt/nvidia/nsight-systems-cli/2024.4.1/linux-desktop-glibc_2_11_3-x64; /opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64; /opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64/host-linux-x64; /opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64/linux-desktop-glibc_2_11_3-x64; /opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64/target-linux-x64;
W22:46:37:720|quadd_common_core|2903490|FileSystem.cpp:172[FindInstalledFile]: File 'config.ini' is not found: /opt/nvidia/nsight-systems-cli/2024.4.1/host-linux-x64; /opt/nvidia/nsight-systems-cli/2024.4.1/linux-desktop-glibc_2_11_3-x64; /opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64; /opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64/host-linux-x64; /opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64/linux-desktop-glibc_2_11_3-x64; /opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64/target-linux-x64;
I22:46:37:720|quadd_daemon|2903490|main.cpp:205[InitConfiguration]: NSys version[2024.4.1.61-244134315967v0]
I22:46:37:720|quadd_daemon|2903490|main.cpp:207[InitConfiguration]: Got config file path:
I22:46:37:720|quadd_daemon|2903490|main.cpp:126[InitPython]: Initing Embedded Python
I22:46:37:720|quadd_common_core|2903490|FileSystem.cpp:156[FindInstalledFile]: File 'python/lib' is found: /opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64/python/lib.
I22:46:37:731|quadd_daemon|2903490|main.cpp:133[InitPython]: Embedded Python Version: 3.8.3-final-0
I22:46:37:795|quadd_gpu_health|2903490|GpuHealth.cpp:70[OpenAllDevicesImpl]: Opening device 0 (status=0) took 8548632 ns
I22:46:37:814|quadd_gpu_health|2903490|GpuHealth.cpp:70[OpenAllDevicesImpl]: Opening device 1 (status=0) took 19318087 ns
I22:46:37:822|quadd_gpu_health|2903490|GpuHealth.cpp:70[OpenAllDevicesImpl]: Opening device 2 (status=0) took 7320359 ns
I22:46:37:830|quadd_gpu_health|2903490|GpuHealth.cpp:70[OpenAllDevicesImpl]: Opening device 3 (status=0) took 8332679 ns
I22:46:37:838|quadd_gpu_health|2903490|GpuHealth.cpp:70[OpenAllDevicesImpl]: Opening device 4 (status=0) took 8351465 ns
I22:46:37:847|quadd_gpu_health|2903490|GpuHealth.cpp:70[OpenAllDevicesImpl]: Opening device 5 (status=0) took 8269138 ns
I22:46:37:856|quadd_gpu_health|2903490|GpuHealth.cpp:70[OpenAllDevicesImpl]: Opening device 6 (status=0) took 8872949 ns
I22:46:37:864|quadd_gpu_health|2903490|GpuHealth.cpp:70[OpenAllDevicesImpl]: Opening device 7 (status=0) took 8655404 ns
I22:46:37:926|quadd_linux_perf|2903490|environment.cpp:182[CheckOSAndKernel]: Detected distribution: Ubuntu. Kernel minimal requirement: 4.3.0-0. Detected kernel: 5.15.0-1055-nvidia
I22:46:37:927|quadd_linux_perf|2903490|event_selection_set.cpp:878[GetConfiguredTriggerEventID]: GetConfiguredTriggerEventID(): CPU Cycles(1) event is default hardware sampling trigger event
I22:46:37:927|quadd_linux_perf|2903490|event_selection_set.cpp:966[GetConfiguredTriggerEventID]: GetConfiguredTriggerEventID(): returning event id 1.
I22:46:37:927|quadd_linux_perf|2903490|event_selection_set.cpp:47[IsLBRBranchSamplingSupported]: LBR backtraces not supported.
I22:46:37:927|quadd_linux_perf|2903490|environment.cpp:182[CheckOSAndKernel]: Detected distribution: Ubuntu. Kernel minimal requirement: 4.3.0-0. Detected kernel: 5.15.0-1055-nvidia
I22:46:37:927|quadd_linux_perf|2903490|event_selection_set.cpp:878[GetConfiguredTriggerEventID]: GetConfiguredTriggerEventID(): CPU Cycles(1) event is default hardware sampling trigger event
I22:46:37:927|quadd_linux_perf|2903490|event_selection_set.cpp:966[GetConfiguredTriggerEventID]: GetConfiguredTriggerEventID(): returning event id 1.
I22:46:37:927|quadd_linux_perf|2903490|event_selection_set.cpp:47[IsLBRBranchSamplingSupported]: LBR backtraces not supported.
I22:46:37:929|drvacc|2903490|:64[]: Driver module override for Cuda
I22:46:41:250|quadd_common_core|2903490|FileSystem.cpp:156[FindInstalledFile]: File 'CudaGpuInfoDumper' is found: /opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64/CudaGpuInfoDumper.
I22:46:41:250|quadd_gpuinfo_dta|2903490|DevToolsApi.cpp:25[CreatePersistenceGuard]: EnableGpuPersistence = 1
I22:46:41:950|drvacc|2903490|:64[]: Driver module override for Cuda
I22:46:42:000|drvacc|2908429|:64[]: Driver module override for Cuda
I22:46:42:556|quadd_gpuinfo_cta|2903490|CudaToolsApi.cpp:441[InitializeGpuInfoListOutOfProcess]: Launching CudaGpuInfoDumper to collect gpu details for 1 processes
E22:46:43:786|quadd_daemon_evtsrc_pw_metrics|2903490|GpuMetrics.cpp:163[BeginSession]: Nvpw.GPU_PeriodicSampler_BeginSession_V2(¶ms): 1 (NVPA_STATUS_ERROR)
This could point to another nsys instance that is already running. pgrep nsys
will show that.
When I command pgrep nsys
, nothing comes upβ¦
Are you running DCGM?
$ sudo service nvidia-dcgm status
Yes
jsh@pnode4:/opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64$ sudo service nvidia-dcgm status
β nvidia-dcgm.service - NVIDIA DCGM service
Loaded: loaded (/lib/systemd/system/nvidia-dcgm.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2024-06-04 15:22:44 KST; 8h ago
Main PID: 686379 (nv-hostengine)
Tasks: 7 (limit: 629145)
Memory: 42.8M
CPU: 3h 33min 36.165s
CGroup: /system.slice/nvidia-dcgm.service
ββ686379 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm
6μ 04 15:22:44 pnode4.idc1.ten1010.io systemd[1]: Started NVIDIA DCGM service.
6μ 04 15:22:48 pnode4.idc1.ten1010.io nv-hostengine[686379]: DCGM initialized
6μ 04 15:22:48 pnode4.idc1.ten1010.io nv-hostengine[686379]: Started host engine version 3.1.8 using port number: 5555
DCGM and Nsys both use the same periodic sampling infrastructure and canβt run concurrently. Try disabling DCGM before running Nsys.
1 Like
IntStr
June 4, 2024, 2:47pm
10
But other server that running DCGM is works fine.
jsh@pnode5:~$ sudo service nvidia-dcgm status
β nvidia-dcgm.service - NVIDIA DCGM service
Loaded: loaded (/lib/systemd/system/nvidia-dcgm.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2024-05-23 18:05:21 KST; 1 week 5 days ago
Main PID: 3807 (nv-hostengine)
Tasks: 7 (limit: 629145)
Memory: 385.3M
CPU: 6d 4h 49min 31.719s
CGroup: /system.slice/nvidia-dcgm.service
ββ3807 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm
Notice: journal has been rotated since unit was started, output may be incomplete.
jsh@pnode5:~$ sudo nsys profile --gpu-metrics-device=help
Possible --gpu-metrics-device values are:
0: NVIDIA H100 80GB HBM3 PCI[0000:1b:00.0] MIG[physical GPU]
1: NVIDIA H100 80GB HBM3 MIG 1g.10gb PCI[0000:1b:00.0] MIG[GI11/CI0]
2: NVIDIA H100 80GB HBM3 MIG 1g.10gb PCI[0000:1b:00.0] MIG[GI13/CI0]
3: NVIDIA H100 80GB HBM3 PCI[0000:43:00.0]
4: NVIDIA H100 80GB HBM3 PCI[0000:52:00.0]
5: NVIDIA H100 80GB HBM3 PCI[0000:61:00.0]
6: NVIDIA H100 80GB HBM3 PCI[0000:9d:00.0]
7: NVIDIA H100 80GB HBM3 PCI[0000:c3:00.0]
8: NVIDIA H100 80GB HBM3 PCI[0000:d1:00.0]
9: NVIDIA H100 80GB HBM3 PCI[0000:df:00.0]
all: Select all supported GPUs
none: Disable GPU Metrics [Default]
Even after stopping nvidia-dcgm
and running nsys again, it is the same.
jsh@pnode4:~$ sudo service nvidia-dcgm stop
jsh@pnode4:~$ sudo service nvidia-dcgm status
β nvidia-dcgm.service - NVIDIA DCGM service
Loaded: loaded (/lib/systemd/system/nvidia-dcgm.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Tue 2024-06-04 23:53:31 KST; 1s ago
Process: 686379 ExecStart=/usr/bin/nv-hostengine -n --service-account nvidia-dcgm (code=exited, status=0/SUCCESS)
Main PID: 686379 (code=exited, status=0/SUCCESS)
CPU: 3h 41min 21.169s
6μ 04 15:22:44 pnode4.idc1.ten1010.io systemd[1]: Started NVIDIA DCGM service.
6μ 04 15:22:48 pnode4.idc1.ten1010.io nv-hostengine[686379]: DCGM initialized
6μ 04 15:22:48 pnode4.idc1.ten1010.io nv-hostengine[686379]: Started host engine version 3.1.8 using port number: 5555
6μ 04 23:53:31 pnode4.idc1.ten1010.io systemd[1]: Stopping NVIDIA DCGM service...
6μ 04 23:53:31 pnode4.idc1.ten1010.io systemd[1]: nvidia-dcgm.service: Deactivated successfully.
6μ 04 23:53:31 pnode4.idc1.ten1010.io systemd[1]: Stopped NVIDIA DCGM service.
6μ 04 23:53:31 pnode4.idc1.ten1010.io systemd[1]: nvidia-dcgm.service: Consumed 3h 41min 21.169s CPU time.
jsh@pnode4:~$ sudo nsys profile --gpu-metrics-device=help
Possible --gpu-metrics-device values are:
Some GPUs are not supported:
NVIDIA A100-SXM4-80GB PCI[0000:07:00.0]
NVIDIA A100-SXM4-80GB PCI[0000:0f:00.0]
NVIDIA A100-SXM4-80GB PCI[0000:47:00.0]
NVIDIA A100-SXM4-80GB PCI[0000:4e:00.0]
NVIDIA A100-SXM4-80GB PCI[0000:87:00.0]
NVIDIA A100-SXM4-80GB PCI[0000:90:00.0]
NVIDIA A100-SXM4-80GB PCI[0000:b7:00.0]
NVIDIA A100-SXM4-80GB PCI[0000:bd:00.0]
See the user guide: https://docs.nvidia.com/nsight-systems/UserGuide/index.html#gms-introduction
At this point I have no suggestions other than to reboot. Could you try that?
IntStr
June 5, 2024, 4:05am
12
Deleting the daemonset with sudo kubectl delete ds dcgm-exporter -n prometheus
instead of sudo service nvidia-dcgm stop
then it worked perfectly!
Thank you for your detailed and prompt responses so far.
1 Like
system
Closed
June 19, 2024, 4:06am
13
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.