But other server that running DCGM is works fine.
jsh@pnode5:~$ sudo service nvidia-dcgm status
● nvidia-dcgm.service - NVIDIA DCGM service
Loaded: loaded (/lib/systemd/system/nvidia-dcgm.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2024-05-23 18:05:21 KST; 1 week 5 days ago
Main PID: 3807 (nv-hostengine)
Tasks: 7 (limit: 629145)
Memory: 385.3M
CPU: 6d 4h 49min 31.719s
CGroup: /system.slice/nvidia-dcgm.service
└─3807 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm
Notice: journal has been rotated since unit was started, output may be incomplete.
jsh@pnode5:~$ sudo nsys profile --gpu-metrics-device=help
Possible --gpu-metrics-device values are:
0: NVIDIA H100 80GB HBM3 PCI[0000:1b:00.0] MIG[physical GPU]
1: NVIDIA H100 80GB HBM3 MIG 1g.10gb PCI[0000:1b:00.0] MIG[GI11/CI0]
2: NVIDIA H100 80GB HBM3 MIG 1g.10gb PCI[0000:1b:00.0] MIG[GI13/CI0]
3: NVIDIA H100 80GB HBM3 PCI[0000:43:00.0]
4: NVIDIA H100 80GB HBM3 PCI[0000:52:00.0]
5: NVIDIA H100 80GB HBM3 PCI[0000:61:00.0]
6: NVIDIA H100 80GB HBM3 PCI[0000:9d:00.0]
7: NVIDIA H100 80GB HBM3 PCI[0000:c3:00.0]
8: NVIDIA H100 80GB HBM3 PCI[0000:d1:00.0]
9: NVIDIA H100 80GB HBM3 PCI[0000:df:00.0]
all: Select all supported GPUs
none: Disable GPU Metrics [Default]
Even after stopping nvidia-dcgm and running nsys again, it is the same.
jsh@pnode4:~$ sudo service nvidia-dcgm stop
jsh@pnode4:~$ sudo service nvidia-dcgm status
○ nvidia-dcgm.service - NVIDIA DCGM service
Loaded: loaded (/lib/systemd/system/nvidia-dcgm.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Tue 2024-06-04 23:53:31 KST; 1s ago
Process: 686379 ExecStart=/usr/bin/nv-hostengine -n --service-account nvidia-dcgm (code=exited, status=0/SUCCESS)
Main PID: 686379 (code=exited, status=0/SUCCESS)
CPU: 3h 41min 21.169s
6월 04 15:22:44 pnode4.idc1.ten1010.io systemd[1]: Started NVIDIA DCGM service.
6월 04 15:22:48 pnode4.idc1.ten1010.io nv-hostengine[686379]: DCGM initialized
6월 04 15:22:48 pnode4.idc1.ten1010.io nv-hostengine[686379]: Started host engine version 3.1.8 using port number: 5555
6월 04 23:53:31 pnode4.idc1.ten1010.io systemd[1]: Stopping NVIDIA DCGM service...
6월 04 23:53:31 pnode4.idc1.ten1010.io systemd[1]: nvidia-dcgm.service: Deactivated successfully.
6월 04 23:53:31 pnode4.idc1.ten1010.io systemd[1]: Stopped NVIDIA DCGM service.
6월 04 23:53:31 pnode4.idc1.ten1010.io systemd[1]: nvidia-dcgm.service: Consumed 3h 41min 21.169s CPU time.
jsh@pnode4:~$ sudo nsys profile --gpu-metrics-device=help
Possible --gpu-metrics-device values are:
Some GPUs are not supported:
NVIDIA A100-SXM4-80GB PCI[0000:07:00.0]
NVIDIA A100-SXM4-80GB PCI[0000:0f:00.0]
NVIDIA A100-SXM4-80GB PCI[0000:47:00.0]
NVIDIA A100-SXM4-80GB PCI[0000:4e:00.0]
NVIDIA A100-SXM4-80GB PCI[0000:87:00.0]
NVIDIA A100-SXM4-80GB PCI[0000:90:00.0]
NVIDIA A100-SXM4-80GB PCI[0000:b7:00.0]
NVIDIA A100-SXM4-80GB PCI[0000:bd:00.0]
See the user guide: https://docs.nvidia.com/nsight-systems/UserGuide/index.html#gms-introduction