After full upgrade (remove everything install from RPM) system reports all ok but CUDA interface doesn’t work (fail at initialisation)
System: latest Rocky 9.5 (5.14.0-503.40.1.el9_5.x86_64) on EPYC 9554
GPU: 8x H100 SXM5
nvidia-smi
Fri May 9 19:52:06 2025
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03 Driver Version: 575.51.03 CUDA Version: 12.9 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:05:00.0 Off | 0 |
| N/A 29C P0 71W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:06:00.0 Off | 0 |
| N/A 27C P0 69W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 2 NVIDIA H100 80GB HBM3 On | 00000000:65:00.0 Off | 0 |
| N/A 27C P0 69W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 3 NVIDIA H100 80GB HBM3 On | 00000000:68:00.0 Off | 0 |
| N/A 24C P0 69W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 4 NVIDIA H100 80GB HBM3 On | 00000000:85:00.0 Off | 0 |
| N/A 24C P0 69W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 5 NVIDIA H100 80GB HBM3 On | 00000000:86:00.0 Off | 0 |
| N/A 26C P0 69W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 6 NVIDIA H100 80GB HBM3 On | 00000000:E5:00.0 Off | 0 |
| N/A 26C P0 70W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 7 NVIDIA H100 80GB HBM3 On | 00000000:E8:00.0 Off | 0 |
| N/A 26C P0 69W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
dcgmi discovery -l
8 GPUs found.
±-------±---------------------------------------------------------------------+
| GPU ID | Device Information |
±-------±---------------------------------------------------------------------+
| 0 | Name: NVIDIA H100 80GB HBM3 |
| | PCI Bus ID: 00000000:05:00.0 |
| | Device UUID: GPU-7c70f809-bc99-62f0-f868-6af7ceea9808 |
±-------±---------------------------------------------------------------------+
| 1 | Name: NVIDIA H100 80GB HBM3 |
| | PCI Bus ID: 00000000:06:00.0 |
| | Device UUID: GPU-984c62a2-ec79-4712-005c-7e0418c4720f |
±-------±---------------------------------------------------------------------+
| 2 | Name: NVIDIA H100 80GB HBM3 |
| | PCI Bus ID: 00000000:65:00.0 |
| | Device UUID: GPU-e8d416e9-12cf-ac2b-fed6-ab7a97d8a40e |
±-------±---------------------------------------------------------------------+
| 3 | Name: NVIDIA H100 80GB HBM3 |
| | PCI Bus ID: 00000000:68:00.0 |
| | Device UUID: GPU-5d6f5b2d-02f0-cc76-ccac-d2a487cbabc3 |
±-------±---------------------------------------------------------------------+
| 4 | Name: NVIDIA H100 80GB HBM3 |
| | PCI Bus ID: 00000000:85:00.0 |
| | Device UUID: GPU-73be25f3-4d8b-2b50-973b-068c836f233b |
±-------±---------------------------------------------------------------------+
| 5 | Name: NVIDIA H100 80GB HBM3 |
| | PCI Bus ID: 00000000:86:00.0 |
| | Device UUID: GPU-0c1bfecd-f937-68ee-e1d1-62eeec8cbfab |
±-------±---------------------------------------------------------------------+
| 6 | Name: NVIDIA H100 80GB HBM3 |
| | PCI Bus ID: 00000000:E5:00.0 |
| | Device UUID: GPU-8769b8a9-c4e0-7d37-8925-dd6d1f769542 |
±-------±---------------------------------------------------------------------+
| 7 | Name: NVIDIA H100 80GB HBM3 |
| | PCI Bus ID: 00000000:E8:00.0 |
| | Device UUID: GPU-0803a76a-ddeb-7164-5bda-ef6914b76aca |
±-------±---------------------------------------------------------------------+
4 NvSwitches found.
±----------+
| Switch ID |
±----------+
| 0 |
| 2 |
| 1 |
| 3 |
±----------+
0 CPUs found.
dcgmi diag -r 2
Successfully ran diagnostic for group.
±--------------------------±-----------------------------------------------+
| Diagnostic | Result |
+===========================+================================================+
|----- Metadata ----------±-----------------------------------------------|
| DCGM Version | 3.3.9 |
| Driver Version Detected | 575.51.03 |
| GPU Device IDs Detected | 2330,2330,2330,2330,2330,2330,2330,2330 |
|----- Deployment --------±-----------------------------------------------|
| Denylist | Pass |
| NVML Library | Pass |
| CUDA Main Library | Pass |
| Permissions and OS Blocks | Pass |
| Persistence Mode | Pass |
| Environment Variables | Pass |
| Page Retirement/Row Remap | Pass |
| Graphics Processes | Pass |
| Inforom | Pass |
±---- Integration -------±-----------------------------------------------+
| PCIe | Fail - All |
| Warning | GPU 0 Error using CUDA API cudaDeviceGetByPCI |
| | BusId Check DCGM and system logs for errors. |
| | Reset GPU. Restart DCGM. Rerun diagnostics. ’ |
| | initialization error’ for GPU 0, bus ID = 000 |
| | 00000:05:00.0 |
±---- Hardware ----------±-----------------------------------------------+
| GPU Memory | Fail - All |
| Warning | GPU 0 Error using CUDA API cuInit Check DCGM |
| | and system logs for errors. Reset GPU. Restar |
| | t DCGM. Rerun diagnostics. Unable to initiali |
| | ze CUDA library: ‘initialization error’. ; ve |
| | rify that the fabric-manager has been started |
| | if applicable. Please check if a CUDA sample |
| | program can be run successfully on this host |
| Warning | GPU 1 Error using CUDA API cuInit Check DCGM |
| | and system logs for errors. Reset GPU. Restar |
| | t DCGM. Rerun diagnostics. Unable to initiali |
| | ze CUDA library: ‘initialization error’. ; ve |
| | rify that the fabric-manager has been started |
| | if applicable. Please check if a CUDA sample |
| | program can be run successfully on this host |
| Warning | GPU 2 Error using CUDA API cuInit Check DCGM |
| | and system logs for errors. Reset GPU. Restar |
| | t DCGM. Rerun diagnostics. Unable to initiali |
| | ze CUDA library: ‘initialization error’. ; ve |
| | rify that the fabric-manager has been started |
| | if applicable. Please check if a CUDA sample |
| | program can be run successfully on this host |
| Warning | GPU 3 Error using CUDA API cuInit Check DCGM |
| | and system logs for errors. Reset GPU. Restar |
| | t DCGM. Rerun diagnostics. Unable to initiali |
| | ze CUDA library: ‘initialization error’. ; ve |
| | rify that the fabric-manager has been started |
| | if applicable. Please check if a CUDA sample |
| | program can be run successfully on this host |
| Warning | GPU 4 Error using CUDA API cuInit Check DCGM |
| | and system logs for errors. Reset GPU. Restar |
| | t DCGM. Rerun diagnostics. Unable to initiali |
| | ze CUDA library: ‘initialization error’. ; ve |
| | rify that the fabric-manager has been started |
| | if applicable. Please check if a CUDA sample |
| | program can be run successfully on this host |
| Warning | GPU 5 Error using CUDA API cuInit Check DCGM |
| | and system logs for errors. Reset GPU. Restar |
| | t DCGM. Rerun diagnostics. Unable to initiali |
| | ze CUDA library: ‘initialization error’. ; ve |
| | rify that the fabric-manager has been started |
| | if applicable. Please check if a CUDA sample |
| | program can be run successfully on this host |
| Warning | GPU 6 Error using CUDA API cuInit Check DCGM |
| | and system logs for errors. Reset GPU. Restar |
| | t DCGM. Rerun diagnostics. Unable to initiali |
| | ze CUDA library: ‘initialization error’. ; ve |
| | rify that the fabric-manager has been started |
| | if applicable. Please check if a CUDA sample |
| | program can be run successfully on this host |
| Warning | GPU 7 Error using CUDA API cuInit Check DCGM |
| | and system logs for errors. Reset GPU. Restar |
| | t DCGM. Rerun diagnostics. Unable to initiali |
| | ze CUDA library: ‘initialization error’. ; ve |
| | rify that the fabric-manager has been started |
| | if applicable. Please check if a CUDA sample |
| | program can be run successfully on this host |
±---- Stress ------------±-----------------------------------------------+
±--------------------------±-----------------------------------------------+
fabric-manager doesn’t report any error:
tail -50 /var/log/fabricmanager.log
[May 09 2025 19:37:23] [INFO] [tid 5829] Append to log file = 1
[May 09 2025 19:37:23] [INFO] [tid 5829] Max Log file size = 1024 (MBs)
[May 09 2025 19:37:23] [INFO] [tid 5829] Use Syslog file = 0
[May 09 2025 19:37:23] [INFO] [tid 5829] Fabric Manager communication ports = 16000
[May 09 2025 19:37:23] [INFO] [tid 5829] Fabric Mode = 0
[May 09 2025 19:37:23] [INFO] [tid 5829] Fabric Mode Restart = 0
[May 09 2025 19:37:23] [INFO] [tid 5829] FM Library communication bind interface = 127.0.0.1
[May 09 2025 19:37:23] [INFO] [tid 5829] FM Library communication unix domain socket =
[May 09 2025 19:37:23] [INFO] [tid 5829] FM Library communication port number = 6666
[May 09 2025 19:37:23] [INFO] [tid 5829] Continue to run when facing failures = 0
[May 09 2025 19:37:23] [INFO] [tid 5829] Option when facing GPU to NVSwitch NVLink failure = 0
[May 09 2025 19:37:23] [INFO] [tid 5829] Option when facing NVSwitch to NVSwitch NVLink failure = 0
[May 09 2025 19:37:23] [INFO] [tid 5829] Option when facing NVSwitch failure = 0
[May 09 2025 19:37:23] [INFO] [tid 5829] Abort CUDA jobs when FM exits = 1
[May 09 2025 19:37:23] [INFO] [tid 5829] Fabric Manager - Subnet Manager IPC socket = unix:/var/run/nvidia-fabricmanager/fm_sm_ipc.socket
[May 09 2025 19:37:23] [INFO] [tid 5829] Fabric Manager - Subnet Manager management port GUID =
[May 09 2025 19:37:23] [INFO] [tid 5829] Disabling RPC mode for single node configuration.
[May 09 2025 19:37:23] [INFO] [tid 5829] GFM Wait Timeout = 360 secs
[May 09 2025 19:37:23] [INFO] [tid 5829] NVLink Domain ID : NDI-4AB75A84-010E-4701-9CF1-12C701384014
[May 09 2025 19:37:23] [INFO] [tid 5829] LMDB_LOG: Successfully initialized LMDB
[May 09 2025 19:37:24] [INFO] [tid 5829] Connected to 1 node.
[May 09 2025 19:37:24] [INFO] [tid 5829] Getting fabric node FM version info
[May 09 2025 19:37:24] [INFO] [tid 5829] getting NVSwitch device information
[May 09 2025 19:37:24] [INFO] [tid 5829] detected system topology is based on DGX/HGX H100
[May 09 2025 19:37:24] [INFO] [tid 5829] fabric topology file /usr/share/nvidia/nvswitch/dgxh100_hgxh100_topology is parsed.
[May 09 2025 19:37:24] [INFO] [tid 5829] number of devices specified in topology file NVSwitches: 4, GPUs: 8
[May 09 2025 19:37:24] [INFO] [tid 5829] getting NVSwitch device information
[May 09 2025 19:37:24] [INFO] [tid 5829] dumping all the detected NVSwitch information for Switch NodeId 0
Index: 00 Physical Id: 0 PCI Bus ID: 00000000:95:00.0 Enabled Link Mask: ffffffff00000000 Arch Type: 3 UUID : SWX-EBAA6F6B-F4AB-B324-1682-74BC9185A012 Num OSFP Cages: 0 TnvlEnabled: 0
Index: 01 Physical Id: 1 PCI Bus ID: 00000000:96:00.0 Enabled Link Mask: ffffffff000000ff Arch Type: 3 UUID : SWX-2166D26A-DF06-410E-AD7E-2B197418A149 Num OSFP Cages: 0 TnvlEnabled: 0
Index: 02 Physical Id: 2 PCI Bus ID: 00000000:97:00.0 Enabled Link Mask: ffffffff000f000f Arch Type: 3 UUID : SWX-9D0B9FFF-97F4-2D22-605B-557486CF2A8D Num OSFP Cages: 0 TnvlEnabled: 0
Index: 03 Physical Id: 3 PCI Bus ID: 00000000:98:00.0 Enabled Link Mask: ffffffff00000000 Arch Type: 3 UUID : SWX-37A41EDB-DAA7-6529-5CC0-86799CBC0A95 Num OSFP Cages: 0 TnvlEnabled: 0
[May 09 2025 19:37:24] [INFO] [tid 5829] number of GPU base board detected: 1
[May 09 2025 19:37:24] [INFO] [tid 5829] getting NVLink device information
[May 09 2025 19:37:24] [INFO] [tid 5829] NVLink Inband feature is enabled. Hence Fabric Manager is not opening and operating on GPUs directly.
[May 09 2025 19:37:24] [INFO] [tid 5829] NVLink Autonomous Link Initialization (ALI) feature is enabled.
[May 09 2025 19:37:24] [INFO] [tid 5829] start NVSwitch 0/0 routing configuration
[May 09 2025 19:37:24] [INFO] [tid 5829] completed NVSwitch 0/0 routing configuration
[May 09 2025 19:37:24] [INFO] [tid 5829] start NVSwitch 0/1 routing configuration
[May 09 2025 19:37:24] [INFO] [tid 5829] completed NVSwitch 0/1 routing configuration
[May 09 2025 19:37:24] [INFO] [tid 5829] start NVSwitch 0/2 routing configuration
[May 09 2025 19:37:24] [INFO] [tid 5829] completed NVSwitch 0/2 routing configuration
[May 09 2025 19:37:24] [INFO] [tid 5829] start NVSwitch 0/3 routing configuration
[May 09 2025 19:37:24] [INFO] [tid 5829] completed NVSwitch 0/3 routing configuration
[May 09 2025 19:37:24] [INFO] [tid 5829] Multi node HA error handling disabled
[May 09 2025 19:37:24] [INFO] [tid 5829] Successfully configured all the available NVSwitches to route GPU NVLink traffic. NVLink Peer-to-Peer support will be enabled once the GPUs are successfully registered with the NVLink fabric.
[May 09 2025 19:37:24] [INFO] [tid 5829] FM starting NvLink Inband messages
[May 09 2025 19:37:24] [INFO] [tid 5829] FM starting NvLink Inband started
but any CUDA application fail at initialisation:
/usr/local/cuda-12.9/extras/demo_suite/bandwidthTest -device all
[CUDA Bandwidth Test] - Starting…
cudaGetDeviceCount returned 3
→ initialization error
Tried even reinstall CUDA 12.8.1 (with drivers 570.133.20) but result is exactly same: CUDA doesn’t initialize
HW is ok a CUDA works with proprietary driver (but without support of HMM which I need)
Any idea where to dig/what to tweak?