Failure during starting NVSM services

Hi there,

I’m trying to use NVSM(NVIDIA System Management) on my server, but fail to start below services:

  • nvsm-api-gateway.service
  • nvsm-exporter.service
  • nvsm-notifier.service

I’ve tried to stop → disable → enable again → start the services, which works on nvsm-exporter.service and nvsm-notifier.service. But havn’t started nvsm-api-gateway.service successfully yet…

Usually, the service got stuck during systemctl start …, which takes 10+ minutes but still fail.

The output of my systemctl are:

root@ubuntu:~# systemctl status nvsm-api-gateway.service 
○ nvsm-api-gateway.service - NVSM API Server to provide DGX System Management APIs
     Loaded: loaded (/lib/systemd/system/nvsm-api-gateway.service; enabled; vendor preset: enabled)
     Active: inactive (dead) (Result: exit-code) since Tue 2024-02-27 06:03:19 UTC; 9min ago
    Process: 201244 ExecStart=/usr/sbin/nvsm_api_gateway (code=exited, status=1/FAILURE)
   Main PID: 201244 (code=exited, status=1/FAILURE)
        CPU: 722ms

Feb 27 06:03:19 ubuntu systemd[1]: nvsm-api-gateway.service: Scheduled restart job, restart counter is at 149.
Feb 27 06:03:19 ubuntu systemd[1]: Stopped NVSM API Server to provide DGX System Management APIs.
root@ubuntu:~# 
root@ubuntu:~# systemctl status nvsm-core.service --no-pager -l
● nvsm-core.service - NVSM Core Service for System Management
     Loaded: loaded (/lib/systemd/system/nvsm-core.service; enabled; vendor preset: enabled)
     Active: activating (start-post) since Tue 2024-02-27 06:12:49 UTC; 4min 13s ago
    Process: 202718 ExecStart=/usr/sbin/nvsm_core --mode=server SERVE (code=exited, status=0/SUCCESS)
   Main PID: 202718 (code=exited, status=0/SUCCESS); Control PID: 202719 (sh)
      Tasks: 2 (limit: 629145)
     Memory: 2.7M
        CPU: 1.816s
     CGroup: /system.slice/nvsm-core.service
             ├─202719 /bin/sh -c "until /usr/sbin/nvsm_core --mode=client --nocolor GET / | jq .Code | grep \"200\"; do sleep 10; done;"
             └─203353 sleep 10

Feb 27 06:12:49 ubuntu systemd[1]: Starting NVSM Core Service for System Management...
Feb 27 06:12:49 ubuntu nvsm_core[202718]: {
Feb 27 06:12:49 ubuntu nvsm_core[202718]:   "Code": 500,
Feb 27 06:12:49 ubuntu nvsm_core[202718]:   "Message": "Failed to find a matching definition file for this platform.\nPlease toggle autogenerate_pdf flag from nvsm.config for auto generating pdf."
Feb 27 06:12:49 ubuntu nvsm_core[202718]: }
root@ubuntu:~# 
root@ubuntu:~# systemctl status nvsm-exporter.service --no-pager -l
○ nvsm-exporter.service - NVSM Exporter to provide DGX System Management Metrics
     Loaded: loaded (/lib/systemd/system/nvsm-exporter.service; enabled; vendor preset: enabled)
     Active: inactive (dead) (Result: exit-code) since Tue 2024-02-27 06:12:49 UTC; 4min 59s ago
    Process: 202931 ExecStart=/usr/sbin/nvsm_exporter (code=exited, status=1/FAILURE)
   Main PID: 202931 (code=exited, status=1/FAILURE)
        CPU: 31ms

Feb 27 06:12:49 ubuntu systemd[1]: nvsm-exporter.service: Scheduled restart job, restart counter is at 38.
Feb 27 06:12:49 ubuntu systemd[1]: Stopped NVSM Exporter to provide DGX System Management Metrics.
root@ubuntu:~# 

List my system info as well for your reference:

  • OS: DGX OS 5.15.0-1045-nvidia
  • GPUs: NVIDIA A100-SXM4-80GB (8 GPUs in total)
  • CUDA version: V12.3.107
  • Nvidia Driver Version: 535.154.05

Also update my nvsm.config here
nvsm.config.log (5.7 KB)

Thank you in advance.

Hi @s950048 ,

Is that system a DGX A100? If so, have you updated the firmware to the latest version?

ScottE

Hi @ScottEllis ,

No, my system is HGX A100 and
Yes, fireware are up to date.

Thanks for your reply

Richard

Ah, that explains it then! NVSM does not support HGX systems.

Where did you install the NVSM packge from?

ScottE

A-ha!
I installed the DGX OS(for other experiments), then was able to install NVSM through apt…

Thanks for your answer!
Richard

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.