Issue with GPU Metrics Collection for NVIDIA A100 on Nsight Systems

Hello,

I am experiencing an issue when attempting to collect GPU metrics using Nsight Systems on a server equipped with NVIDIA A100-SXM4-80GB GPUs.
I receive the following error message indicating that some GPUs are not supported.

$ sudo nsys profile --gpu-metrics-device=help
Possible --gpu-metrics-device values are:

Some GPUs are not supported:
    NVIDIA A100-SXM4-80GB PCI[0000:07:00.0]
    NVIDIA A100-SXM4-80GB PCI[0000:0f:00.0]
    NVIDIA A100-SXM4-80GB PCI[0000:47:00.0]
    NVIDIA A100-SXM4-80GB PCI[0000:4e:00.0]
    NVIDIA A100-SXM4-80GB PCI[0000:87:00.0]
    NVIDIA A100-SXM4-80GB PCI[0000:90:00.0]
    NVIDIA A100-SXM4-80GB PCI[0000:b7:00.0]
    NVIDIA A100-SXM4-80GB PCI[0000:bd:00.0]

I have also attached the my system settings.

$ sudo nsys status -e
Timestamp counter supported: Yes

CPU Profiling Environment Check
Root privilege: enabled
Linux Kernel Paranoid Level = 4
Linux Distribution = Ubuntu
Linux Kernel Version = 5.15.0-1055-nvidia: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Not Available
CPU Profiling Environment (process-tree): OK
CPU Profiling Environment (system-wide): OK
$ sudo nsys --version
NVIDIA Nsight Systems version 2024.4.1.61-244134315967v0
$ sudo nvidia-smi --version
NVIDIA-SMI version  : 550.54.15
NVML version        : 550.54
DRIVER version      : 550.54.15
CUDA Version        : 12.4

Thank you for your assistance.

@pkovalenko

Please share a file nsys-ui.log created by nsys in the current directory after running sudo nsys profile --gpu-metrics-device=0. To enable log, navigate to the target directory (target-linux-x64, can be found by running readlink -f `which nsys` | xargs dirname) and rename nvlog.config.template to nvlog.config.

Here is the content of the nsys-ui.log

jsh@pnode4:/opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64$ cat nsys-ui.log
W22:46:37:720|quadd_common_core|2903490|FileSystem.cpp:172[FindInstalledFile]: File 'nsys-config.ini' is not found: /opt/nvidia/nsight-systems-cli/2024.4.1/host-linux-x64; /opt/nvidia/nsight-systems-cli/2024.4.1/linux-desktop-glibc_2_11_3-x64; /opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64; /opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64/host-linux-x64; /opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64/linux-desktop-glibc_2_11_3-x64; /opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64/target-linux-x64; 
W22:46:37:720|quadd_common_core|2903490|FileSystem.cpp:172[FindInstalledFile]: File 'config.ini' is not found: /opt/nvidia/nsight-systems-cli/2024.4.1/host-linux-x64; /opt/nvidia/nsight-systems-cli/2024.4.1/linux-desktop-glibc_2_11_3-x64; /opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64; /opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64/host-linux-x64; /opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64/linux-desktop-glibc_2_11_3-x64; /opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64/target-linux-x64; 
I22:46:37:720|quadd_daemon|2903490|main.cpp:205[InitConfiguration]: NSys version[2024.4.1.61-244134315967v0] 
I22:46:37:720|quadd_daemon|2903490|main.cpp:207[InitConfiguration]: Got config file path: 
I22:46:37:720|quadd_daemon|2903490|main.cpp:126[InitPython]: Initing Embedded Python
I22:46:37:720|quadd_common_core|2903490|FileSystem.cpp:156[FindInstalledFile]: File 'python/lib' is found: /opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64/python/lib.
I22:46:37:731|quadd_daemon|2903490|main.cpp:133[InitPython]: Embedded Python Version: 3.8.3-final-0
I22:46:37:795|quadd_gpu_health|2903490|GpuHealth.cpp:70[OpenAllDevicesImpl]: Opening device 0 (status=0) took 8548632 ns
I22:46:37:814|quadd_gpu_health|2903490|GpuHealth.cpp:70[OpenAllDevicesImpl]: Opening device 1 (status=0) took 19318087 ns
I22:46:37:822|quadd_gpu_health|2903490|GpuHealth.cpp:70[OpenAllDevicesImpl]: Opening device 2 (status=0) took 7320359 ns
I22:46:37:830|quadd_gpu_health|2903490|GpuHealth.cpp:70[OpenAllDevicesImpl]: Opening device 3 (status=0) took 8332679 ns
I22:46:37:838|quadd_gpu_health|2903490|GpuHealth.cpp:70[OpenAllDevicesImpl]: Opening device 4 (status=0) took 8351465 ns
I22:46:37:847|quadd_gpu_health|2903490|GpuHealth.cpp:70[OpenAllDevicesImpl]: Opening device 5 (status=0) took 8269138 ns
I22:46:37:856|quadd_gpu_health|2903490|GpuHealth.cpp:70[OpenAllDevicesImpl]: Opening device 6 (status=0) took 8872949 ns
I22:46:37:864|quadd_gpu_health|2903490|GpuHealth.cpp:70[OpenAllDevicesImpl]: Opening device 7 (status=0) took 8655404 ns
I22:46:37:926|quadd_linux_perf|2903490|environment.cpp:182[CheckOSAndKernel]: Detected distribution: Ubuntu. Kernel minimal requirement: 4.3.0-0. Detected kernel: 5.15.0-1055-nvidia
I22:46:37:927|quadd_linux_perf|2903490|event_selection_set.cpp:878[GetConfiguredTriggerEventID]: GetConfiguredTriggerEventID(): CPU Cycles(1) event is default hardware sampling trigger event
I22:46:37:927|quadd_linux_perf|2903490|event_selection_set.cpp:966[GetConfiguredTriggerEventID]: GetConfiguredTriggerEventID(): returning event id 1.
I22:46:37:927|quadd_linux_perf|2903490|event_selection_set.cpp:47[IsLBRBranchSamplingSupported]: LBR backtraces not supported.
I22:46:37:927|quadd_linux_perf|2903490|environment.cpp:182[CheckOSAndKernel]: Detected distribution: Ubuntu. Kernel minimal requirement: 4.3.0-0. Detected kernel: 5.15.0-1055-nvidia
I22:46:37:927|quadd_linux_perf|2903490|event_selection_set.cpp:878[GetConfiguredTriggerEventID]: GetConfiguredTriggerEventID(): CPU Cycles(1) event is default hardware sampling trigger event
I22:46:37:927|quadd_linux_perf|2903490|event_selection_set.cpp:966[GetConfiguredTriggerEventID]: GetConfiguredTriggerEventID(): returning event id 1.
I22:46:37:927|quadd_linux_perf|2903490|event_selection_set.cpp:47[IsLBRBranchSamplingSupported]: LBR backtraces not supported.
I22:46:37:929|drvacc|2903490|:64[]: Driver module override for Cuda
I22:46:41:250|quadd_common_core|2903490|FileSystem.cpp:156[FindInstalledFile]: File 'CudaGpuInfoDumper' is found: /opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64/CudaGpuInfoDumper.
I22:46:41:250|quadd_gpuinfo_dta|2903490|DevToolsApi.cpp:25[CreatePersistenceGuard]: EnableGpuPersistence = 1
I22:46:41:950|drvacc|2903490|:64[]: Driver module override for Cuda
I22:46:42:000|drvacc|2908429|:64[]: Driver module override for Cuda
I22:46:42:556|quadd_gpuinfo_cta|2903490|CudaToolsApi.cpp:441[InitializeGpuInfoListOutOfProcess]: Launching CudaGpuInfoDumper to collect gpu details for 1 processes
E22:46:43:786|quadd_daemon_evtsrc_pw_metrics|2903490|GpuMetrics.cpp:163[BeginSession]: Nvpw.GPU_PeriodicSampler_BeginSession_V2(&params): 1 (NVPA_STATUS_ERROR)

This could point to another nsys instance that is already running. pgrep nsys will show that.

When I command pgrep nsys, nothing comes up…

Are you running DCGM?
$ sudo service nvidia-dcgm status

Yes

jsh@pnode4:/opt/nvidia/nsight-systems-cli/2024.4.1/target-linux-x64$ sudo service nvidia-dcgm status
● nvidia-dcgm.service - NVIDIA DCGM service
     Loaded: loaded (/lib/systemd/system/nvidia-dcgm.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2024-06-04 15:22:44 KST; 8h ago
   Main PID: 686379 (nv-hostengine)
      Tasks: 7 (limit: 629145)
     Memory: 42.8M
        CPU: 3h 33min 36.165s
     CGroup: /system.slice/nvidia-dcgm.service
             └─686379 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm

 6μ›” 04 15:22:44 pnode4.idc1.ten1010.io systemd[1]: Started NVIDIA DCGM service.
 6μ›” 04 15:22:48 pnode4.idc1.ten1010.io nv-hostengine[686379]: DCGM initialized
 6μ›” 04 15:22:48 pnode4.idc1.ten1010.io nv-hostengine[686379]: Started host engine version 3.1.8 using port number: 5555

DCGM and Nsys both use the same periodic sampling infrastructure and can’t run concurrently. Try disabling DCGM before running Nsys.

1 Like

But other server that running DCGM is works fine.

jsh@pnode5:~$ sudo service nvidia-dcgm status
● nvidia-dcgm.service - NVIDIA DCGM service
     Loaded: loaded (/lib/systemd/system/nvidia-dcgm.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2024-05-23 18:05:21 KST; 1 week 5 days ago
   Main PID: 3807 (nv-hostengine)
      Tasks: 7 (limit: 629145)
     Memory: 385.3M
        CPU: 6d 4h 49min 31.719s
     CGroup: /system.slice/nvidia-dcgm.service
             └─3807 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm

Notice: journal has been rotated since unit was started, output may be incomplete.
jsh@pnode5:~$ sudo nsys profile --gpu-metrics-device=help
Possible --gpu-metrics-device values are:
        0: NVIDIA H100 80GB HBM3 PCI[0000:1b:00.0] MIG[physical GPU]
        1: NVIDIA H100 80GB HBM3 MIG 1g.10gb PCI[0000:1b:00.0] MIG[GI11/CI0]
        2: NVIDIA H100 80GB HBM3 MIG 1g.10gb PCI[0000:1b:00.0] MIG[GI13/CI0]
        3: NVIDIA H100 80GB HBM3 PCI[0000:43:00.0]
        4: NVIDIA H100 80GB HBM3 PCI[0000:52:00.0]
        5: NVIDIA H100 80GB HBM3 PCI[0000:61:00.0]
        6: NVIDIA H100 80GB HBM3 PCI[0000:9d:00.0]
        7: NVIDIA H100 80GB HBM3 PCI[0000:c3:00.0]
        8: NVIDIA H100 80GB HBM3 PCI[0000:d1:00.0]
        9: NVIDIA H100 80GB HBM3 PCI[0000:df:00.0]
        all: Select all supported GPUs
        none: Disable GPU Metrics [Default]

Even after stopping nvidia-dcgm and running nsys again, it is the same.

jsh@pnode4:~$ sudo service nvidia-dcgm stop
jsh@pnode4:~$ sudo service nvidia-dcgm status
β—‹ nvidia-dcgm.service - NVIDIA DCGM service
     Loaded: loaded (/lib/systemd/system/nvidia-dcgm.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Tue 2024-06-04 23:53:31 KST; 1s ago
    Process: 686379 ExecStart=/usr/bin/nv-hostengine -n --service-account nvidia-dcgm (code=exited, status=0/SUCCESS)
   Main PID: 686379 (code=exited, status=0/SUCCESS)
        CPU: 3h 41min 21.169s

 6μ›” 04 15:22:44 pnode4.idc1.ten1010.io systemd[1]: Started NVIDIA DCGM service.
 6μ›” 04 15:22:48 pnode4.idc1.ten1010.io nv-hostengine[686379]: DCGM initialized
 6μ›” 04 15:22:48 pnode4.idc1.ten1010.io nv-hostengine[686379]: Started host engine version 3.1.8 using port number: 5555
 6μ›” 04 23:53:31 pnode4.idc1.ten1010.io systemd[1]: Stopping NVIDIA DCGM service...
 6μ›” 04 23:53:31 pnode4.idc1.ten1010.io systemd[1]: nvidia-dcgm.service: Deactivated successfully.
 6μ›” 04 23:53:31 pnode4.idc1.ten1010.io systemd[1]: Stopped NVIDIA DCGM service.
 6μ›” 04 23:53:31 pnode4.idc1.ten1010.io systemd[1]: nvidia-dcgm.service: Consumed 3h 41min 21.169s CPU time.
jsh@pnode4:~$ sudo nsys profile --gpu-metrics-device=help
Possible --gpu-metrics-device values are:

Some GPUs are not supported:
        NVIDIA A100-SXM4-80GB PCI[0000:07:00.0]
        NVIDIA A100-SXM4-80GB PCI[0000:0f:00.0]
        NVIDIA A100-SXM4-80GB PCI[0000:47:00.0]
        NVIDIA A100-SXM4-80GB PCI[0000:4e:00.0]
        NVIDIA A100-SXM4-80GB PCI[0000:87:00.0]
        NVIDIA A100-SXM4-80GB PCI[0000:90:00.0]
        NVIDIA A100-SXM4-80GB PCI[0000:b7:00.0]
        NVIDIA A100-SXM4-80GB PCI[0000:bd:00.0]

See the user guide: https://docs.nvidia.com/nsight-systems/UserGuide/index.html#gms-introduction

At this point I have no suggestions other than to reboot. Could you try that?

Deleting the daemonset with sudo kubectl delete ds dcgm-exporter -n prometheus instead of sudo service nvidia-dcgm stop then it worked perfectly!
Thank you for your detailed and prompt responses so far.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.