Hi, I recently missed a problem on H100-SXM platform. Firstly, when i use torch, the output error is below:
Python 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
/home/jlxu/anaconda3/envs/vllm-xjl-env/lib/python3.10/site-packages/torch/cuda/__init__.py:128: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False
I found that nv-fabricmanager service is dead, log is below:
○ nvidia-fabricmanager.service - NVIDIA fabric manager service
Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
Active: inactive (dead)
then i check nv-fabricmanager logs /var/log/fabricmanager.log, the error is below:
[Oct 18 2024 09:41:56] [ERROR] [tid 24226] received NVLink inband message arrived on an NVLink port NodeId 0 NVSwitch 1 port 47 which is not part of any active partition
[Oct 18 2024 09:41:59] [INFO] [tid 24234] Received an inband message: Message header details: magic Id:adbc request Id:17ff355ac4df1da8 status:0 type:0 length:4d
Message payload details: Probe request: Pci Info:100004400 Module Id:1 Uuid:GPU-29f1f5f2-f3f0-12de-9d07-c2654256ee67 Discovered LinkMask:3ffff Enabled LinkMask:3ffff Cap Mask:2
[Oct 18 2024 09:41:59] [ERROR] [tid 24226] received NVLink inband message arrived on an NVLink port NodeId 0 NVSwitch 2 port 47 which is not part of any active partition
[Oct 18 2024 09:42:02] [INFO] [tid 24234] Received an inband message: Message header details: magic Id:adbc request Id:17ff355b5da3e678 status:0 type:0 length:4d
Message payload details: Probe request: Pci Info:100004e00 Module Id:3 Uuid:GPU-f7dee7c8-144a-7f53-7959-eed789c11931 Discovered LinkMask:3ffff Enabled LinkMask:3ffff Cap Mask:2
[Oct 18 2024 09:42:02] [ERROR] [tid 24226] received NVLink inband message arrived on an NVLink port NodeId 0 NVSwitch 2 port 2 which is not part of any active partition
I try to restart the nv-fabricmanager service, but the it’s hang for long time and start failed. my nv-fabricmanager config file is here:
# NVIDIA Fabric Manager configuration file.
# Note: This configuration file is read during Fabric Manager service startup. So, Fabric Manager
# service restart is required for new settings to take effect.
# Description: Fabric Manager logging levels
# Possible Values:
# 0 - All the logging is disabled
# 1 - Set log level to CRITICAL and above
# 2 - Set log level to ERROR and above
# 3 - Set log level to WARNING and above
# 4 - Set log level to INFO and above
LOG_LEVEL=4
# Description: Filename for Fabric Manager logs
# Possible Values:
# Full path/filename string (max length of 256). Logs will be redirected to console(stderr)
# if the specified log file can't be opened or the path is empty.
LOG_FILE_NAME=/var/log/fabricmanager.log
# Description: Append to an existing log file or overwrite the logs
# Possible Values:
# 0 - No (Log file will be overwritten)
# 1 - Yes (Append to existing log)
LOG_APPEND_TO_LOG=1
# Description: Max size of log file (in MB)
# Possible Values:
# Any Integer values
LOG_FILE_MAX_SIZE=1024
# Description: Number of times the FM log is rotated once it reaches LOG_FILE_MAX_SIZE
# Possible Values:
# 0 - Log is not rotated. Logging is stopped once the FM log file reaches
# the size specified in LOG_FILE_MAX_SIZE
# Non-zero Integer - Log is rotated upto the number of times specified in LOG_MAX_ROTATE_COUNT,
# after the size of the log file reaches the size specified in LOG_FILE_MAX_SIZE.
# Combined FM log size is LOG_FILE_MAX_SIZE multipled by LOG_MAX_ROTATE_COUNT+1
# Once this threshold is reached, the oldest log file is purged and reused.
LOG_MAX_ROTATE_COUNT=3
# Description: Redirect all the logs to syslog instead of logging to file
# Possible Values:
# 0 - No
# 1 - Yes
#LOG_USE_SYSLOG=0
LOG_USE_SYSLOG=0
# Description: daemonize Fabric Manager on start-up
# Possible Values:
# 0 - No (Do not daemonize and run fabric manager as a normal process)
# 1 - Yes (Run Fabric Manager process as Unix daemon
DAEMONIZE=1
# Description: Network interface to listen for Global and Local Fabric Manager communication
# Possible Values:
# A valid IPv4 address. By default, uses loopback (127.0.0.1) interface
BIND_INTERFACE_IP=127.0.0.1
# Description: Starting TCP port number for Global and Local Fabric Manager communication
# Possible Values:
# Any value between 0 and 65535
STARTING_TCP_PORT=16000
# Description: Use Unix sockets instead of TCP Socket for Global and Local Fabric Manager communication
# Possible Values:
# Unix domain socket path (max length of 256)
# Default Value:
# Empty String (TCP socket will be used instead of Unix sockets)
UNIX_SOCKET_PATH=
# Description: Fabric Manager Operating Mode
# Possible Values:
# 0 - Start Fabric Manager in Bare metal or Full pass through virtualization mode
# 1 - Start Fabric Manager in Shared NVSwitch multitenancy mode.
# 2 - Start Fabric Manager in vGPU based multitenancy mode.
#FABRIC_MODE=0
FABRIC_MODE=1
# Description: Restart Fabric Manager after exit. Applicable only in Shared NVSwitch or vGPU based multitenancy mode
# Possible Values:
# 0 - Start Fabric Manager and follow full initialization sequence
# 1 - Start Fabric Manager and follow Shared NVSwitch or vGPU based multitenancy mode resiliency/restart sequence.
FABRIC_MODE_RESTART=0
# Description: Specify the filename to be used to save Fabric Manager states.
# Valid only if Shared NVSwitch or vGPU based multitenancy mode is enabled
# Possible Values:
# Full path/filename string (max length of 256)
STATE_FILE_NAME=/tmp/fabricmanager.state
# Description: Network interface to listen for Fabric Manager SDK/API to communicate with running FM instance.
# Possible Values:
# A valid IPv4 address. By default, uses loopback (127.0.0.1) interface
FM_CMD_BIND_INTERFACE=127.0.0.1
# Description: TCP port number for Fabric Manager SDK/API to communicate with running FM instance.
# Possible Values:
# Any value between 0 and 65535
FM_CMD_PORT_NUMBER=6666
# Description: Use Unix sockets instead of TCP Socket for Fabric Manager SDK/API communication
# Possible Values:
# Unix domain socket path (max length of 256)
# Default Value:
# Empty String (TCP socket will be used instead of Unix sockets)
FM_CMD_UNIX_SOCKET_PATH=
# Description: Fabric Manager does not exit when facing failures
# Possible Values:
# 0 – Fabric Manager service will terminate on errors such as, NVSwitch and GPU config failure,
# typical software errors etc.
# 1 – Fabric Manager service will stay running on errors such as, NVSwitch and GPU config failure,
# typical software errors etc. However, the system will be uninitialized and CUDA application
# launch will fail.
FM_STAY_RESIDENT_ON_FAILURES=0
# Description: Degraded Mode options when there is an Access Link Failure (GPU to NVSwitch NVLink failure)
# Possible Values:
# In bare metal or full passthrough virtualization mode
# 0 - Remove the GPU with the Access NVLink failure from NVLink P2P capability
# 1 - Disable the NVSwitch and its peer NVSwitch, which reduces NVLink P2P bandwidth
# Note: This is only effective on DGX A100 and HGX A100 NVSwitch based systems
#
# In Shared NVSwitch or vGPU based multitenancy mode
# 0 - Disable partitions which are using the Access Link failed GPU
# Note: This is only effective on DGX A100 and HGX A100 NVSwitch based systems
# 1 - Disable the NVSwitch and its peer NVSwitch,
# all partitions will be available but with reduced NVLink P2P bandwidth
# Note: This is only effective on DGX A100 and HGX A100 NVSwitch based systems
ACCESS_LINK_FAILURE_MODE=0
# Description: Degraded Mode options when there is a Trunk Link Failure (NVSwitch to NVSwitch NVLink failure)
# Possible Values:
# In bare metal or full passthrough virtualization mode
# 0 - Exit Fabric Manager and leave the system/NVLinks uninitialized
# Note: This is only effective on DGX A100 and HGX A100 NVSwitch based systems
# 1 - Disable the NVSwitch and its peer NVSwitch, which reduces NVLink P2P bandwidth
# Note: This is only effective on DGX A100 and HGX A100 NVSwitch based systems
#
# In Shared NVSwitch or vGPU based multitenancy mode
# 0 - Remove partitions that are using the Trunk NVLinks
# Note: This is only effective on DGX A100 and HGX A100 NVSwitch based systems
# 1 - Disable the NVSwitch and its peer NVSwitch,
# all partitions will be available but with reduced NVLink P2P bandwidth
# Note: This is only effective on DGX A100 and HGX A100 NVSwitch based systems
TRUNK_LINK_FAILURE_MODE=0
# Description: Degraded Mode options when there is a NVSwitch failure or an NVSwitch is excluded
# Possible Values:
# In bare metal or full passthrough virtualization mode
# 0 - Abort Fabric Manager
# 1 - Disable the NVSwitch and its peer NVSwitch, which reduces P2P bandwidth
# Note: This is only effective on DGX A100 and HGX A100 NVSwitch based systems
#
# In Shared NVSwitch or vGPU based multitenancy mode
# 0 - Disable partitions that are using the NVSwitch
# This is only effective on DGX A100 and HGX A100 NVSwitch based systems
# 1 - Disable the NVSwitch and its peer NVSwitch,
# all partitions will be available but with reduced NVLink P2P bandwidth
# Note: This is only effective on DGX A100 and HGX A100 NVSwitch based systems
NVSWITCH_FAILURE_MODE=0
# Description: Control running CUDA jobs behavior when Fabric Manager service is stopped or terminated
# Possible Values:
# 0 - Do not abort running CUDA jobs when Fabric Manager exits. However new CUDA job launch will fail.
# 1 - Abort all running CUDA jobs when Fabric Manager exits.
# Note: This is not effective on DGX H100 and HGX H100 NVSwitch based systems
ABORT_CUDA_JOBS_ON_FM_EXIT=1
# Description: Absolute directory path containing Fabric Manager topology files
# Possible Values:
# A valid directory path string (max length of 256)
TOPOLOGY_FILE_PATH=/usr/share/nvidia/nvswitch
# Description: Use RPC instead of raw sockets for communication between global and local fabric manager.
# Possible values:
# 0 - Do not use RPCs
# 1 - Use RPC
USE_RPC=1
# Description: Name of the network interface used for communication.
# OPTIONAL - If empty, network interface will be determined by matching bind IP to
# node configuration file. Only necessary to configure if the bind IP
# is present on multiple network interfaces.
# Possible Values:
# Interface names like eth0, ens32 .. etc
# Default Value:
NETWORK_INTERFACE=
# Description: Enable authentication and encryption for RPC communication.
# NOTE: If USE_RPC is 0, this is ignored.
# Possible Values:
# 0: Disable encryption and authentication
# 1: Enable encryption and authentication
# Default value: 0
ENABLE_AUTH_ENCRYPTION=0
# Description: This determines how fabric manager will try to retrieve the keys, certificates, and certificate
# authority for authentication and encryption.
# If ENABLE_AUTH_ENCRYPTION is enabled (1), then AUTH_SOURCE must be configured
# as one of the supported values. An empty or unexpected value will prevent initialization.
# If USE_RPC is 0, this is ignored.
# Possible Values:
# FILE: The provided values are paths to files on the file system.
# ENV_PATH: The provided values are environment variable names to retrieve, and the values in the
# environment variables are treated as paths to files on the file system.
# ENV_VAL: The provided values are environment variable names to retrieve, and the values in the
# environment variables are treated as the actual values for the key/cert/cert auth.
AUTH_SOURCE=
# Description: These fields are interpreted based on how AUTH_SOURCE is configured
SERVER_KEY=
SERVER_CERT=
SERVER_CERT_AUTH=
CLIENT_KEY=
CLIENT_CERT=
CLIENT_CERT_AUTH=
# Description: Override the target hostname for authentication of the certificates and keys. This allows
# certificates with common names that do not match the ip addresses provided for the nodes.
# Example:
# If the certificate has the subject:
# "/C=US/ST=CA/L=Santa Clara/O=NVIDIA/OU=Test/CN=localhost"
# The certificate validation will expect the connection hostname to be "localhost", by
# setting IMEX_SECURITY_TARGET_OVERRIDE=localhost you can cause override the connection
# hostname for security purposes to be "localhost", allowing the connection to succeed.
# If USE_RPC is 0, this is ignored.
SECURITY_TARGET_OVERRIDE=
# Description: This tunable determines the domain behavior in case the trunk links are down or experience
# fatal errors. When MNNVL_RESILIENCY_MODE is set to 0, in case a single trunk link fails, the
# inter-node traffic between the impacted L1 switch node and the rest of the domain will be
# unavailable. When MNNVL_RESILIENCY_MODE is set to 1, the domain can sustain up to failures
# of half (excluded) of the trunk links on an L1 switch before the inter-node traffic between
# this L1 switch node and the rest of the domain becomes unavailable.
MNNVL_RESILIENCY_MODE=0
# Description: Determine whether a default partition needs to be created
# Possible Values:
# 0 - No partitions are created during GFM initialization. GFM disables routing until an API request
# to create a partition is successful.
# 1(default) - Creates a default partition during GFM initialization. GFM creates the partition to include
# all GPUs in the topology and enables routing so that all GPUs can communicate to each other
how could i solve this problem ? thanks very much !