Nvidia-fabricmanager Error on H100 SXM: received NVLink inband message arrived on an NVLink xx which is not part of any active partition

Hi, I recently missed a problem on H100-SXM platform. Firstly, when i use torch, the output error is below:

Python 3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
/home/jlxu/anaconda3/envs/vllm-xjl-env/lib/python3.10/site-packages/torch/cuda/__init__.py:128: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False

I found that nv-fabricmanager service is dead, log is below:

○ nvidia-fabricmanager.service - NVIDIA fabric manager service
     Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
     Active: inactive (dead)

then i check nv-fabricmanager logs /var/log/fabricmanager.log, the error is below:

[Oct 18 2024 09:41:56] [ERROR] [tid 24226] received NVLink inband message arrived on an NVLink port NodeId 0 NVSwitch 1 port 47 which is not part of any active partition
[Oct 18 2024 09:41:59] [INFO] [tid 24234] Received an inband message:  Message header details: magic Id:adbc request Id:17ff355ac4df1da8 status:0 type:0 length:4d
Message payload details: Probe request: Pci Info:100004400 Module Id:1 Uuid:GPU-29f1f5f2-f3f0-12de-9d07-c2654256ee67 Discovered LinkMask:3ffff Enabled LinkMask:3ffff Cap Mask:2

[Oct 18 2024 09:41:59] [ERROR] [tid 24226] received NVLink inband message arrived on an NVLink port NodeId 0 NVSwitch 2 port 47 which is not part of any active partition
[Oct 18 2024 09:42:02] [INFO] [tid 24234] Received an inband message:  Message header details: magic Id:adbc request Id:17ff355b5da3e678 status:0 type:0 length:4d
Message payload details: Probe request: Pci Info:100004e00 Module Id:3 Uuid:GPU-f7dee7c8-144a-7f53-7959-eed789c11931 Discovered LinkMask:3ffff Enabled LinkMask:3ffff Cap Mask:2

[Oct 18 2024 09:42:02] [ERROR] [tid 24226] received NVLink inband message arrived on an NVLink port NodeId 0 NVSwitch 2 port 2 which is not part of any active partition

I try to restart the nv-fabricmanager service, but the it’s hang for long time and start failed. my nv-fabricmanager config file is here:

# NVIDIA Fabric Manager configuration file.
# Note: This configuration file is read during Fabric Manager service startup. So, Fabric Manager 
# service restart is required for new settings to take effect.

#       Description: Fabric Manager logging levels
#       Possible Values:
#               0  - All the logging is disabled
#               1  - Set log level to CRITICAL and above
#               2  - Set log level to ERROR and above
#               3  - Set log level to WARNING and above
#               4  - Set log level to INFO and above
LOG_LEVEL=4

#       Description: Filename for Fabric Manager logs
#       Possible Values:
#       Full path/filename string (max length of 256). Logs will be redirected to console(stderr)
#       if the specified log file can't be opened or the path is empty.
LOG_FILE_NAME=/var/log/fabricmanager.log

#       Description: Append to an existing log file or overwrite the logs
#       Possible Values:
#               0  - No  (Log file will be overwritten)
#               1  - Yes (Append to existing log)
LOG_APPEND_TO_LOG=1

#       Description: Max size of log file (in MB)
#       Possible Values:
#               Any Integer values
LOG_FILE_MAX_SIZE=1024

#   Description: Number of times the FM log is rotated once it reaches LOG_FILE_MAX_SIZE
#   Possible Values:
#       0                - Log is not rotated. Logging is stopped once the FM log file reaches
#                          the size specified in LOG_FILE_MAX_SIZE
#       Non-zero Integer - Log is rotated upto the number of times specified in LOG_MAX_ROTATE_COUNT,
#                          after the size of the log file reaches the size specified in LOG_FILE_MAX_SIZE.
#                          Combined FM log size is LOG_FILE_MAX_SIZE multipled by LOG_MAX_ROTATE_COUNT+1
#                          Once this threshold is reached, the oldest log file is purged and reused.
LOG_MAX_ROTATE_COUNT=3

#       Description: Redirect all the logs to syslog instead of logging to file
#       Possible Values:
#               0  - No
#               1  - Yes
#LOG_USE_SYSLOG=0
LOG_USE_SYSLOG=0

#       Description: daemonize Fabric Manager on start-up
#       Possible Values:
#       0  - No (Do not daemonize and run fabric manager as a normal process)
#       1  - Yes (Run Fabric Manager process as Unix daemon
DAEMONIZE=1

#       Description: Network interface to listen for Global and Local Fabric Manager communication
#       Possible Values:
#               A valid IPv4 address. By default, uses loopback (127.0.0.1) interface
BIND_INTERFACE_IP=127.0.0.1

#       Description: Starting TCP port number for Global and Local Fabric Manager communication
#       Possible Values:
#               Any value between 0 and 65535
STARTING_TCP_PORT=16000

#   Description: Use Unix sockets instead of TCP Socket for Global and Local Fabric Manager communication
#       Possible Values:
#               Unix domain socket path (max length of 256)
#       Default Value: 
#               Empty String (TCP socket will be used instead of Unix sockets)
UNIX_SOCKET_PATH=

#       Description: Fabric Manager Operating Mode
#       Possible Values:
#       0  - Start Fabric Manager in Bare metal or Full pass through virtualization mode
#       1  - Start Fabric Manager in Shared NVSwitch multitenancy mode. 
#       2  - Start Fabric Manager in vGPU based multitenancy mode.
#FABRIC_MODE=0
FABRIC_MODE=1

#       Description: Restart Fabric Manager after exit. Applicable only in Shared NVSwitch or vGPU based multitenancy mode
#       Possible Values:
#       0  - Start Fabric Manager and follow full initialization sequence
#       1  - Start Fabric Manager and follow Shared NVSwitch or vGPU based multitenancy mode resiliency/restart sequence.
FABRIC_MODE_RESTART=0

#       Description: Specify the filename to be used to save Fabric Manager states.
#                    Valid only if Shared NVSwitch or vGPU based multitenancy mode is enabled
#       Possible Values:
#           Full path/filename string (max length of 256)
STATE_FILE_NAME=/tmp/fabricmanager.state

#       Description: Network interface to listen for Fabric Manager SDK/API to communicate with running FM instance.
#       Possible Values:
#               A valid IPv4 address. By default, uses loopback (127.0.0.1) interface
FM_CMD_BIND_INTERFACE=127.0.0.1 

#       Description: TCP port number for Fabric Manager SDK/API to communicate with running FM instance.
#       Possible Values:
#               Any value between 0 and 65535
FM_CMD_PORT_NUMBER=6666

#       Description: Use Unix sockets instead of TCP Socket for Fabric Manager SDK/API communication
#       Possible Values:
#               Unix domain socket path (max length of 256)
#       Default Value: 
#               Empty String (TCP socket will be used instead of Unix sockets)
FM_CMD_UNIX_SOCKET_PATH=

#   Description: Fabric Manager does not exit when facing failures
#   Possible Values:
#       0 – Fabric Manager service will terminate on errors such as, NVSwitch and GPU config failure, 
#           typical software errors etc.  
#       1 – Fabric Manager service will stay running on errors such as, NVSwitch and GPU config failure, 
#           typical software errors etc. However, the system will be uninitialized and CUDA application 
#           launch will fail. 
FM_STAY_RESIDENT_ON_FAILURES=0

#   Description: Degraded Mode options when there is an Access Link Failure (GPU to NVSwitch NVLink failure)
#   Possible Values:
#       In bare metal or full passthrough virtualization mode
#       0  - Remove the GPU with the Access NVLink failure from NVLink P2P capability
#       1  - Disable the NVSwitch and its peer NVSwitch, which reduces NVLink P2P bandwidth
#            Note: This is only effective on DGX A100 and HGX A100 NVSwitch based systems
#
#       In Shared NVSwitch or vGPU based multitenancy mode
#       0  - Disable partitions which are using the Access Link failed GPU
#            Note: This is only effective on DGX A100 and HGX A100 NVSwitch based systems
#       1  - Disable the NVSwitch and its peer NVSwitch,
#            all partitions will be available but with reduced NVLink P2P bandwidth
#            Note: This is only effective on DGX A100 and HGX A100 NVSwitch based systems
ACCESS_LINK_FAILURE_MODE=0

#   Description: Degraded Mode options when there is a Trunk Link Failure (NVSwitch to NVSwitch NVLink failure)
#   Possible Values:
#       In bare metal or full passthrough virtualization mode
#       0  - Exit Fabric Manager and leave the system/NVLinks uninitialized
#            Note: This is only effective on DGX A100 and HGX A100 NVSwitch based systems
#       1  - Disable the NVSwitch and its peer NVSwitch, which reduces NVLink P2P bandwidth
#            Note: This is only effective on DGX A100 and HGX A100 NVSwitch based systems
#
#       In Shared NVSwitch or vGPU based multitenancy mode
#       0  - Remove partitions that are using the Trunk NVLinks
#            Note: This is only effective on DGX A100 and HGX A100 NVSwitch based systems
#       1  - Disable the NVSwitch and its peer NVSwitch,
#            all partitions will be available but with reduced NVLink P2P bandwidth
#            Note: This is only effective on DGX A100 and HGX A100 NVSwitch based systems
TRUNK_LINK_FAILURE_MODE=0

#   Description: Degraded Mode options when there is a NVSwitch failure or an NVSwitch is excluded
#   Possible Values:
#       In bare metal or full passthrough virtualization mode
#       0  - Abort Fabric Manager
#       1  - Disable the NVSwitch and its peer NVSwitch, which reduces P2P bandwidth
#            Note: This is only effective on DGX A100 and HGX A100 NVSwitch based systems
#
#       In Shared NVSwitch or vGPU based multitenancy mode
#       0  - Disable partitions that are using the NVSwitch
#            This is only effective on DGX A100 and HGX A100 NVSwitch based systems
#       1  - Disable the NVSwitch and its peer NVSwitch,
#            all partitions will be available but with reduced NVLink P2P bandwidth
#            Note: This is only effective on DGX A100 and HGX A100 NVSwitch based systems
NVSWITCH_FAILURE_MODE=0

#       Description: Control running CUDA jobs behavior when Fabric Manager service is stopped or terminated
#       Possible Values:
#       0  - Do not abort running CUDA jobs when Fabric Manager exits. However new CUDA job launch will fail.
#       1  - Abort all running CUDA jobs when Fabric Manager exits.
#            Note: This is not effective on DGX H100 and HGX H100 NVSwitch based systems
ABORT_CUDA_JOBS_ON_FM_EXIT=1

#       Description: Absolute directory path containing Fabric Manager topology files
#       Possible Values:
#               A valid directory path string (max length of 256)
TOPOLOGY_FILE_PATH=/usr/share/nvidia/nvswitch

#   Description: Use RPC instead of raw sockets for communication between global and local fabric manager.
#   Possible values:
#       0 - Do not use RPCs
#       1 - Use RPC
USE_RPC=1

#  Description: Name of the network interface used for communication.
#               OPTIONAL - If empty, network interface will be determined by matching bind IP to
#                          node configuration file.  Only necessary to configure if the bind IP
#                          is present on multiple network interfaces.
#  Possible Values:
#      Interface names like eth0, ens32 .. etc
#  Default Value:
NETWORK_INTERFACE=


#  Description:  Enable authentication and encryption for RPC communication.
#                NOTE: If USE_RPC is 0, this is ignored.
#  Possible Values:
#      0:  Disable encryption and authentication
#      1:  Enable encryption and authentication
#  Default value: 0
ENABLE_AUTH_ENCRYPTION=0

#  Description:  This determines how fabric manager will try to retrieve the keys, certificates, and certificate
#                authority for authentication and encryption.
#                If ENABLE_AUTH_ENCRYPTION is enabled (1), then AUTH_SOURCE must be configured
#                as one of the supported values.  An empty or unexpected value will prevent initialization.
#                If USE_RPC is 0, this is ignored.
#  Possible Values:
#      FILE:      The provided values are paths to files on the file system.
#      ENV_PATH:  The provided values are environment variable names to retrieve, and the values in the
#                 environment variables are treated as paths to files on the file system.
#      ENV_VAL:   The provided values are environment variable names to retrieve, and the values in the
#                 environment variables are treated as the actual values for the key/cert/cert auth.
AUTH_SOURCE=

# Description:  These fields are interpreted based on how AUTH_SOURCE is configured
SERVER_KEY=
SERVER_CERT=
SERVER_CERT_AUTH=
CLIENT_KEY=
CLIENT_CERT=
CLIENT_CERT_AUTH=

#  Description:  Override the target hostname for authentication of the certificates and keys.  This allows
#                certificates with common names that do not match the ip addresses provided for the nodes.
#                Example:
#                  If the certificate has the subject:
#                    "/C=US/ST=CA/L=Santa Clara/O=NVIDIA/OU=Test/CN=localhost"
#                  The certificate validation will expect the connection hostname to be "localhost", by
#                  setting IMEX_SECURITY_TARGET_OVERRIDE=localhost you can cause override the connection
#                  hostname for security purposes to be "localhost", allowing the connection to succeed.
#                If USE_RPC is 0, this is ignored.
SECURITY_TARGET_OVERRIDE=

#  Description: This tunable determines the domain behavior in case the trunk links are down or experience
#               fatal errors. When MNNVL_RESILIENCY_MODE is set to 0, in case a single trunk link fails, the
#               inter-node traffic between the impacted L1 switch node and the rest of the domain will be
#               unavailable. When MNNVL_RESILIENCY_MODE is set to 1, the domain can sustain up to failures
#               of half (excluded) of the trunk links on an L1 switch before the inter-node traffic between
#               this L1 switch node and the rest of the domain becomes unavailable.
MNNVL_RESILIENCY_MODE=0

#       Description: Determine whether a default partition needs to be created
#       Possible Values:
#       0             - No partitions are created during GFM initialization. GFM disables routing until an API request 
#                     to create a partition is successful.
#       1(default)    - Creates a default partition during GFM initialization. GFM creates the partition to include
#                     all GPUs in the topology and enables routing so that all GPUs can communicate to each other

how could i solve this problem ? thanks very much !

Hi,

The message “Received NVLink inband message on NVLink port NodeId 0 NVSwitch 1 port 47, which is not part of any active partition” indicates that the GPU partition may not be activated.

Have you tried activating the GPU partition?

For more details, please refer to the NVIDIA Fabric Manager User Guide.

Thanks,
Chen