Issue with GPU Metrics Collection for NVIDIA A100 on Nsight Systems

Hello,

I am experiencing an issue when attempting to collect GPU metrics using Nsight Systems on a server equipped with NVIDIA A100-SXM4-80GB GPUs.
I receive the following error message indicating that some GPUs are not supported.

$ sudo nsys profile --gpu-metrics-device=help
Possible --gpu-metrics-device values are:

Some GPUs are not supported:
    NVIDIA A100-SXM4-80GB PCI[0000:07:00.0]
    NVIDIA A100-SXM4-80GB PCI[0000:0f:00.0]
    NVIDIA A100-SXM4-80GB PCI[0000:47:00.0]
    NVIDIA A100-SXM4-80GB PCI[0000:4e:00.0]
    NVIDIA A100-SXM4-80GB PCI[0000:87:00.0]
    NVIDIA A100-SXM4-80GB PCI[0000:90:00.0]
    NVIDIA A100-SXM4-80GB PCI[0000:b7:00.0]
    NVIDIA A100-SXM4-80GB PCI[0000:bd:00.0]

I have also attached the my system settings.

$ sudo nsys status -e
Timestamp counter supported: Yes

CPU Profiling Environment Check
Root privilege: enabled
Linux Kernel Paranoid Level = 4
Linux Distribution = Ubuntu
Linux Kernel Version = 5.15.0-1055-nvidia: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Not Available
CPU Profiling Environment (process-tree): OK
CPU Profiling Environment (system-wide): OK
$ sudo nsys --version
NVIDIA Nsight Systems version 2024.4.1.61-244134315967v0
$ sudo nvidia-smi --version
NVIDIA-SMI version  : 550.54.15
NVML version        : 550.54
DRIVER version      : 550.54.15
CUDA Version        : 12.4

Thank you for your assistance.

@pkovalenko

Please share a file nsys-ui.log created by nsys in the current directory after running sudo nsys profile --gpu-metrics-device=0. To enable log, navigate to the target directory (target-linux-x64, can be found by running readlink -f `which nsys` | xargs dirname) and rename nvlog.config.template to nvlog.config.

Here is the content of the nsys-ui.log

jsh@pnode4:/opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64$ cat nsys-ui.log
W22:46:37:720|quadd_common_core|2903490|FileSystem.cpp:172[FindInstalledFile]: File 'nsys-config.ini' is not found: /opt/nvidia/nsight-systems-cli/2024.4.1/host-linux-x64; /opt/nvidia/nsight-systems-cli/2024.4.1/linux-desktop-glibc_2_11_3-x64; /opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64; /opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64/host-linux-x64; /opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64/linux-desktop-glibc_2_11_3-x64; /opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64/target-linux-x64; 
W22:46:37:720|quadd_common_core|2903490|FileSystem.cpp:172[FindInstalledFile]: File 'config.ini' is not found: /opt/nvidia/nsight-systems-cli/2024.4.1/host-linux-x64; /opt/nvidia/nsight-systems-cli/2024.4.1/linux-desktop-glibc_2_11_3-x64; /opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64; /opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64/host-linux-x64; /opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64/linux-desktop-glibc_2_11_3-x64; /opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64/target-linux-x64; 
I22:46:37:720|quadd_daemon|2903490|main.cpp:205[InitConfiguration]: NSys version[2024.4.1.61-244134315967v0] 
I22:46:37:720|quadd_daemon|2903490|main.cpp:207[InitConfiguration]: Got config file path: 
I22:46:37:720|quadd_daemon|2903490|main.cpp:126[InitPython]: Initing Embedded Python
I22:46:37:720|quadd_common_core|2903490|FileSystem.cpp:156[FindInstalledFile]: File 'python/lib' is found: /opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64/python/lib.
I22:46:37:731|quadd_daemon|2903490|main.cpp:133[InitPython]: Embedded Python Version: 3.8.3-final-0
I22:46:37:795|quadd_gpu_health|2903490|GpuHealth.cpp:70[OpenAllDevicesImpl]: Opening device 0 (status=0) took 8548632 ns
I22:46:37:814|quadd_gpu_health|2903490|GpuHealth.cpp:70[OpenAllDevicesImpl]: Opening device 1 (status=0) took 19318087 ns
I22:46:37:822|quadd_gpu_health|2903490|GpuHealth.cpp:70[OpenAllDevicesImpl]: Opening device 2 (status=0) took 7320359 ns
I22:46:37:830|quadd_gpu_health|2903490|GpuHealth.cpp:70[OpenAllDevicesImpl]: Opening device 3 (status=0) took 8332679 ns
I22:46:37:838|quadd_gpu_health|2903490|GpuHealth.cpp:70[OpenAllDevicesImpl]: Opening device 4 (status=0) took 8351465 ns
I22:46:37:847|quadd_gpu_health|2903490|GpuHealth.cpp:70[OpenAllDevicesImpl]: Opening device 5 (status=0) took 8269138 ns
I22:46:37:856|quadd_gpu_health|2903490|GpuHealth.cpp:70[OpenAllDevicesImpl]: Opening device 6 (status=0) took 8872949 ns
I22:46:37:864|quadd_gpu_health|2903490|GpuHealth.cpp:70[OpenAllDevicesImpl]: Opening device 7 (status=0) took 8655404 ns
I22:46:37:926|quadd_linux_perf|2903490|environment.cpp:182[CheckOSAndKernel]: Detected distribution: Ubuntu. Kernel minimal requirement: 4.3.0-0. Detected kernel: 5.15.0-1055-nvidia
I22:46:37:927|quadd_linux_perf|2903490|event_selection_set.cpp:878[GetConfiguredTriggerEventID]: GetConfiguredTriggerEventID(): CPU Cycles(1) event is default hardware sampling trigger event
I22:46:37:927|quadd_linux_perf|2903490|event_selection_set.cpp:966[GetConfiguredTriggerEventID]: GetConfiguredTriggerEventID(): returning event id 1.
I22:46:37:927|quadd_linux_perf|2903490|event_selection_set.cpp:47[IsLBRBranchSamplingSupported]: LBR backtraces not supported.
I22:46:37:927|quadd_linux_perf|2903490|environment.cpp:182[CheckOSAndKernel]: Detected distribution: Ubuntu. Kernel minimal requirement: 4.3.0-0. Detected kernel: 5.15.0-1055-nvidia
I22:46:37:927|quadd_linux_perf|2903490|event_selection_set.cpp:878[GetConfiguredTriggerEventID]: GetConfiguredTriggerEventID(): CPU Cycles(1) event is default hardware sampling trigger event
I22:46:37:927|quadd_linux_perf|2903490|event_selection_set.cpp:966[GetConfiguredTriggerEventID]: GetConfiguredTriggerEventID(): returning event id 1.
I22:46:37:927|quadd_linux_perf|2903490|event_selection_set.cpp:47[IsLBRBranchSamplingSupported]: LBR backtraces not supported.
I22:46:37:929|drvacc|2903490|:64[]: Driver module override for Cuda
I22:46:41:250|quadd_common_core|2903490|FileSystem.cpp:156[FindInstalledFile]: File 'CudaGpuInfoDumper' is found: /opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64/CudaGpuInfoDumper.
I22:46:41:250|quadd_gpuinfo_dta|2903490|DevToolsApi.cpp:25[CreatePersistenceGuard]: EnableGpuPersistence = 1
I22:46:41:950|drvacc|2903490|:64[]: Driver module override for Cuda
I22:46:42:000|drvacc|2908429|:64[]: Driver module override for Cuda
I22:46:42:556|quadd_gpuinfo_cta|2903490|CudaToolsApi.cpp:441[InitializeGpuInfoListOutOfProcess]: Launching CudaGpuInfoDumper to collect gpu details for 1 processes
E22:46:43:786|quadd_daemon_evtsrc_pw_metrics|2903490|GpuMetrics.cpp:163[BeginSession]: Nvpw.GPU_PeriodicSampler_BeginSession_V2(&params): 1 (NVPA_STATUS_ERROR)

This could point to another nsys instance that is already running. pgrep nsys will show that.

When I command pgrep nsys, nothing comes up…

Are you running DCGM?
$ sudo service nvidia-dcgm status

Yes

jsh@pnode4:/opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64$ sudo service nvidia-dcgm status
● nvidia-dcgm.service - NVIDIA DCGM service
     Loaded: loaded (/lib/systemd/system/nvidia-dcgm.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2024-06-04 15:22:44 KST; 8h ago
   Main PID: 686379 (nv-hostengine)
      Tasks: 7 (limit: 629145)
     Memory: 42.8M
        CPU: 3h 33min 36.165s
     CGroup: /system.slice/nvidia-dcgm.service
             └─686379 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm

 6μ›” 04 15:22:44 pnode4.idc1.ten1010.io systemd[1]: Started NVIDIA DCGM service.
 6μ›” 04 15:22:48 pnode4.idc1.ten1010.io nv-hostengine[686379]: DCGM initialized
 6μ›” 04 15:22:48 pnode4.idc1.ten1010.io nv-hostengine[686379]: Started host engine version 3.1.8 using port number: 5555

DCGM and Nsys both use the same periodic sampling infrastructure and can’t run concurrently. Try disabling DCGM before running Nsys.

But other server that running DCGM is works fine.

jsh@pnode5:~$ sudo service nvidia-dcgm status
● nvidia-dcgm.service - NVIDIA DCGM service
     Loaded: loaded (/lib/systemd/system/nvidia-dcgm.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2024-05-23 18:05:21 KST; 1 week 5 days ago
   Main PID: 3807 (nv-hostengine)
      Tasks: 7 (limit: 629145)
     Memory: 385.3M
        CPU: 6d 4h 49min 31.719s
     CGroup: /system.slice/nvidia-dcgm.service
             └─3807 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm

Notice: journal has been rotated since unit was started, output may be incomplete.
jsh@pnode5:~$ sudo nsys profile --gpu-metrics-device=help
Possible --gpu-metrics-device values are:
        0: NVIDIA H100 80GB HBM3 PCI[0000:1b:00.0] MIG[physical GPU]
        1: NVIDIA H100 80GB HBM3 MIG 1g.10gb PCI[0000:1b:00.0] MIG[GI11/CI0]
        2: NVIDIA H100 80GB HBM3 MIG 1g.10gb PCI[0000:1b:00.0] MIG[GI13/CI0]
        3: NVIDIA H100 80GB HBM3 PCI[0000:43:00.0]
        4: NVIDIA H100 80GB HBM3 PCI[0000:52:00.0]
        5: NVIDIA H100 80GB HBM3 PCI[0000:61:00.0]
        6: NVIDIA H100 80GB HBM3 PCI[0000:9d:00.0]
        7: NVIDIA H100 80GB HBM3 PCI[0000:c3:00.0]
        8: NVIDIA H100 80GB HBM3 PCI[0000:d1:00.0]
        9: NVIDIA H100 80GB HBM3 PCI[0000:df:00.0]
        all: Select all supported GPUs
        none: Disable GPU Metrics [Default]

Even after stopping nvidia-dcgm and running nsys again, it is the same.

jsh@pnode4:~$ sudo service nvidia-dcgm stop
jsh@pnode4:~$ sudo service nvidia-dcgm status
β—‹ nvidia-dcgm.service - NVIDIA DCGM service
     Loaded: loaded (/lib/systemd/system/nvidia-dcgm.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Tue 2024-06-04 23:53:31 KST; 1s ago
    Process: 686379 ExecStart=/usr/bin/nv-hostengine -n --service-account nvidia-dcgm (code=exited, status=0/SUCCESS)
   Main PID: 686379 (code=exited, status=0/SUCCESS)
        CPU: 3h 41min 21.169s

 6μ›” 04 15:22:44 pnode4.idc1.ten1010.io systemd[1]: Started NVIDIA DCGM service.
 6μ›” 04 15:22:48 pnode4.idc1.ten1010.io nv-hostengine[686379]: DCGM initialized
 6μ›” 04 15:22:48 pnode4.idc1.ten1010.io nv-hostengine[686379]: Started host engine version 3.1.8 using port number: 5555
 6μ›” 04 23:53:31 pnode4.idc1.ten1010.io systemd[1]: Stopping NVIDIA DCGM service...
 6μ›” 04 23:53:31 pnode4.idc1.ten1010.io systemd[1]: nvidia-dcgm.service: Deactivated successfully.
 6μ›” 04 23:53:31 pnode4.idc1.ten1010.io systemd[1]: Stopped NVIDIA DCGM service.
 6μ›” 04 23:53:31 pnode4.idc1.ten1010.io systemd[1]: nvidia-dcgm.service: Consumed 3h 41min 21.169s CPU time.
jsh@pnode4:~$ sudo nsys profile --gpu-metrics-device=help
Possible --gpu-metrics-device values are:

Some GPUs are not supported:
        NVIDIA A100-SXM4-80GB PCI[0000:07:00.0]
        NVIDIA A100-SXM4-80GB PCI[0000:0f:00.0]
        NVIDIA A100-SXM4-80GB PCI[0000:47:00.0]
        NVIDIA A100-SXM4-80GB PCI[0000:4e:00.0]
        NVIDIA A100-SXM4-80GB PCI[0000:87:00.0]
        NVIDIA A100-SXM4-80GB PCI[0000:90:00.0]
        NVIDIA A100-SXM4-80GB PCI[0000:b7:00.0]
        NVIDIA A100-SXM4-80GB PCI[0000:bd:00.0]

See the user guide: https://docs.nvidia.com/nsight-systems/UserGuide/index.html#gms-introduction

At this point I have no suggestions other than to reboot. Could you try that?

Deleting the daemonset with sudo kubectl delete ds dcgm-exporter -n prometheus instead of sudo service nvidia-dcgm stop then it worked perfectly!
Thank you for your detailed and prompt responses so far.