Issue with GPU Metrics Collection for NVIDIA A100 on Nsight Systems

But other server that running DCGM is works fine.

jsh@pnode5:~$ sudo service nvidia-dcgm status
● nvidia-dcgm.service - NVIDIA DCGM service
     Loaded: loaded (/lib/systemd/system/nvidia-dcgm.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2024-05-23 18:05:21 KST; 1 week 5 days ago
   Main PID: 3807 (nv-hostengine)
      Tasks: 7 (limit: 629145)
     Memory: 385.3M
        CPU: 6d 4h 49min 31.719s
     CGroup: /system.slice/nvidia-dcgm.service
             └─3807 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm

Notice: journal has been rotated since unit was started, output may be incomplete.
jsh@pnode5:~$ sudo nsys profile --gpu-metrics-device=help
Possible --gpu-metrics-device values are:
        0: NVIDIA H100 80GB HBM3 PCI[0000:1b:00.0] MIG[physical GPU]
        1: NVIDIA H100 80GB HBM3 MIG 1g.10gb PCI[0000:1b:00.0] MIG[GI11/CI0]
        2: NVIDIA H100 80GB HBM3 MIG 1g.10gb PCI[0000:1b:00.0] MIG[GI13/CI0]
        3: NVIDIA H100 80GB HBM3 PCI[0000:43:00.0]
        4: NVIDIA H100 80GB HBM3 PCI[0000:52:00.0]
        5: NVIDIA H100 80GB HBM3 PCI[0000:61:00.0]
        6: NVIDIA H100 80GB HBM3 PCI[0000:9d:00.0]
        7: NVIDIA H100 80GB HBM3 PCI[0000:c3:00.0]
        8: NVIDIA H100 80GB HBM3 PCI[0000:d1:00.0]
        9: NVIDIA H100 80GB HBM3 PCI[0000:df:00.0]
        all: Select all supported GPUs
        none: Disable GPU Metrics [Default]

Even after stopping nvidia-dcgm and running nsys again, it is the same.

jsh@pnode4:~$ sudo service nvidia-dcgm stop
jsh@pnode4:~$ sudo service nvidia-dcgm status
○ nvidia-dcgm.service - NVIDIA DCGM service
     Loaded: loaded (/lib/systemd/system/nvidia-dcgm.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Tue 2024-06-04 23:53:31 KST; 1s ago
    Process: 686379 ExecStart=/usr/bin/nv-hostengine -n --service-account nvidia-dcgm (code=exited, status=0/SUCCESS)
   Main PID: 686379 (code=exited, status=0/SUCCESS)
        CPU: 3h 41min 21.169s

 6월 04 15:22:44 pnode4.idc1.ten1010.io systemd[1]: Started NVIDIA DCGM service.
 6월 04 15:22:48 pnode4.idc1.ten1010.io nv-hostengine[686379]: DCGM initialized
 6월 04 15:22:48 pnode4.idc1.ten1010.io nv-hostengine[686379]: Started host engine version 3.1.8 using port number: 5555
 6월 04 23:53:31 pnode4.idc1.ten1010.io systemd[1]: Stopping NVIDIA DCGM service...
 6월 04 23:53:31 pnode4.idc1.ten1010.io systemd[1]: nvidia-dcgm.service: Deactivated successfully.
 6월 04 23:53:31 pnode4.idc1.ten1010.io systemd[1]: Stopped NVIDIA DCGM service.
 6월 04 23:53:31 pnode4.idc1.ten1010.io systemd[1]: nvidia-dcgm.service: Consumed 3h 41min 21.169s CPU time.
jsh@pnode4:~$ sudo nsys profile --gpu-metrics-device=help
Possible --gpu-metrics-device values are:

Some GPUs are not supported:
        NVIDIA A100-SXM4-80GB PCI[0000:07:00.0]
        NVIDIA A100-SXM4-80GB PCI[0000:0f:00.0]
        NVIDIA A100-SXM4-80GB PCI[0000:47:00.0]
        NVIDIA A100-SXM4-80GB PCI[0000:4e:00.0]
        NVIDIA A100-SXM4-80GB PCI[0000:87:00.0]
        NVIDIA A100-SXM4-80GB PCI[0000:90:00.0]
        NVIDIA A100-SXM4-80GB PCI[0000:b7:00.0]
        NVIDIA A100-SXM4-80GB PCI[0000:bd:00.0]

See the user guide: https://docs.nvidia.com/nsight-systems/UserGuide/index.html#gms-introduction