CUDA can't initialize after upgrade

After full upgrade (remove everything install from RPM) system reports all ok but CUDA interface doesn’t work (fail at initialisation)
System: latest Rocky 9.5 (5.14.0-503.40.1.el9_5.x86_64) on EPYC 9554
GPU: 8x H100 SXM5
nvidia-smi
Fri May 9 19:52:06 2025
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03 Driver Version: 575.51.03 CUDA Version: 12.9 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:05:00.0 Off | 0 |
| N/A 29C P0 71W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:06:00.0 Off | 0 |
| N/A 27C P0 69W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 2 NVIDIA H100 80GB HBM3 On | 00000000:65:00.0 Off | 0 |
| N/A 27C P0 69W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 3 NVIDIA H100 80GB HBM3 On | 00000000:68:00.0 Off | 0 |
| N/A 24C P0 69W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 4 NVIDIA H100 80GB HBM3 On | 00000000:85:00.0 Off | 0 |
| N/A 24C P0 69W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 5 NVIDIA H100 80GB HBM3 On | 00000000:86:00.0 Off | 0 |
| N/A 26C P0 69W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 6 NVIDIA H100 80GB HBM3 On | 00000000:E5:00.0 Off | 0 |
| N/A 26C P0 70W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 7 NVIDIA H100 80GB HBM3 On | 00000000:E8:00.0 Off | 0 |
| N/A 26C P0 69W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
dcgmi discovery -l
8 GPUs found.
±-------±---------------------------------------------------------------------+
| GPU ID | Device Information |
±-------±---------------------------------------------------------------------+
| 0 | Name: NVIDIA H100 80GB HBM3 |
| | PCI Bus ID: 00000000:05:00.0 |
| | Device UUID: GPU-7c70f809-bc99-62f0-f868-6af7ceea9808 |
±-------±---------------------------------------------------------------------+
| 1 | Name: NVIDIA H100 80GB HBM3 |
| | PCI Bus ID: 00000000:06:00.0 |
| | Device UUID: GPU-984c62a2-ec79-4712-005c-7e0418c4720f |
±-------±---------------------------------------------------------------------+
| 2 | Name: NVIDIA H100 80GB HBM3 |
| | PCI Bus ID: 00000000:65:00.0 |
| | Device UUID: GPU-e8d416e9-12cf-ac2b-fed6-ab7a97d8a40e |
±-------±---------------------------------------------------------------------+
| 3 | Name: NVIDIA H100 80GB HBM3 |
| | PCI Bus ID: 00000000:68:00.0 |
| | Device UUID: GPU-5d6f5b2d-02f0-cc76-ccac-d2a487cbabc3 |
±-------±---------------------------------------------------------------------+
| 4 | Name: NVIDIA H100 80GB HBM3 |
| | PCI Bus ID: 00000000:85:00.0 |
| | Device UUID: GPU-73be25f3-4d8b-2b50-973b-068c836f233b |
±-------±---------------------------------------------------------------------+
| 5 | Name: NVIDIA H100 80GB HBM3 |
| | PCI Bus ID: 00000000:86:00.0 |
| | Device UUID: GPU-0c1bfecd-f937-68ee-e1d1-62eeec8cbfab |
±-------±---------------------------------------------------------------------+
| 6 | Name: NVIDIA H100 80GB HBM3 |
| | PCI Bus ID: 00000000:E5:00.0 |
| | Device UUID: GPU-8769b8a9-c4e0-7d37-8925-dd6d1f769542 |
±-------±---------------------------------------------------------------------+
| 7 | Name: NVIDIA H100 80GB HBM3 |
| | PCI Bus ID: 00000000:E8:00.0 |
| | Device UUID: GPU-0803a76a-ddeb-7164-5bda-ef6914b76aca |
±-------±---------------------------------------------------------------------+
4 NvSwitches found.
±----------+
| Switch ID |
±----------+
| 0 |
| 2 |
| 1 |
| 3 |
±----------+
0 CPUs found.
dcgmi diag -r 2
Successfully ran diagnostic for group.
±--------------------------±-----------------------------------------------+
| Diagnostic | Result |
+===========================+================================================+
|----- Metadata ----------±-----------------------------------------------|
| DCGM Version | 3.3.9 |
| Driver Version Detected | 575.51.03 |
| GPU Device IDs Detected | 2330,2330,2330,2330,2330,2330,2330,2330 |
|----- Deployment --------±-----------------------------------------------|
| Denylist | Pass |
| NVML Library | Pass |
| CUDA Main Library | Pass |
| Permissions and OS Blocks | Pass |
| Persistence Mode | Pass |
| Environment Variables | Pass |
| Page Retirement/Row Remap | Pass |
| Graphics Processes | Pass |
| Inforom | Pass |
±---- Integration -------±-----------------------------------------------+
| PCIe | Fail - All |
| Warning | GPU 0 Error using CUDA API cudaDeviceGetByPCI |
| | BusId Check DCGM and system logs for errors. |
| | Reset GPU. Restart DCGM. Rerun diagnostics. ’ |
| | initialization error’ for GPU 0, bus ID = 000 |
| | 00000:05:00.0 |
±---- Hardware ----------±-----------------------------------------------+
| GPU Memory | Fail - All |
| Warning | GPU 0 Error using CUDA API cuInit Check DCGM |
| | and system logs for errors. Reset GPU. Restar |
| | t DCGM. Rerun diagnostics. Unable to initiali |
| | ze CUDA library: ‘initialization error’. ; ve |
| | rify that the fabric-manager has been started |
| | if applicable. Please check if a CUDA sample |
| | program can be run successfully on this host |
| Warning | GPU 1 Error using CUDA API cuInit Check DCGM |
| | and system logs for errors. Reset GPU. Restar |
| | t DCGM. Rerun diagnostics. Unable to initiali |
| | ze CUDA library: ‘initialization error’. ; ve |
| | rify that the fabric-manager has been started |
| | if applicable. Please check if a CUDA sample |
| | program can be run successfully on this host |
| Warning | GPU 2 Error using CUDA API cuInit Check DCGM |
| | and system logs for errors. Reset GPU. Restar |
| | t DCGM. Rerun diagnostics. Unable to initiali |
| | ze CUDA library: ‘initialization error’. ; ve |
| | rify that the fabric-manager has been started |
| | if applicable. Please check if a CUDA sample |
| | program can be run successfully on this host |
| Warning | GPU 3 Error using CUDA API cuInit Check DCGM |
| | and system logs for errors. Reset GPU. Restar |
| | t DCGM. Rerun diagnostics. Unable to initiali |
| | ze CUDA library: ‘initialization error’. ; ve |
| | rify that the fabric-manager has been started |
| | if applicable. Please check if a CUDA sample |
| | program can be run successfully on this host |
| Warning | GPU 4 Error using CUDA API cuInit Check DCGM |
| | and system logs for errors. Reset GPU. Restar |
| | t DCGM. Rerun diagnostics. Unable to initiali |
| | ze CUDA library: ‘initialization error’. ; ve |
| | rify that the fabric-manager has been started |
| | if applicable. Please check if a CUDA sample |
| | program can be run successfully on this host |
| Warning | GPU 5 Error using CUDA API cuInit Check DCGM |
| | and system logs for errors. Reset GPU. Restar |
| | t DCGM. Rerun diagnostics. Unable to initiali |
| | ze CUDA library: ‘initialization error’. ; ve |
| | rify that the fabric-manager has been started |
| | if applicable. Please check if a CUDA sample |
| | program can be run successfully on this host |
| Warning | GPU 6 Error using CUDA API cuInit Check DCGM |
| | and system logs for errors. Reset GPU. Restar |
| | t DCGM. Rerun diagnostics. Unable to initiali |
| | ze CUDA library: ‘initialization error’. ; ve |
| | rify that the fabric-manager has been started |
| | if applicable. Please check if a CUDA sample |
| | program can be run successfully on this host |
| Warning | GPU 7 Error using CUDA API cuInit Check DCGM |
| | and system logs for errors. Reset GPU. Restar |
| | t DCGM. Rerun diagnostics. Unable to initiali |
| | ze CUDA library: ‘initialization error’. ; ve |
| | rify that the fabric-manager has been started |
| | if applicable. Please check if a CUDA sample |
| | program can be run successfully on this host |
±---- Stress ------------±-----------------------------------------------+
±--------------------------±-----------------------------------------------+

fabric-manager doesn’t report any error:
tail -50 /var/log/fabricmanager.log
[May 09 2025 19:37:23] [INFO] [tid 5829] Append to log file = 1
[May 09 2025 19:37:23] [INFO] [tid 5829] Max Log file size = 1024 (MBs)
[May 09 2025 19:37:23] [INFO] [tid 5829] Use Syslog file = 0
[May 09 2025 19:37:23] [INFO] [tid 5829] Fabric Manager communication ports = 16000
[May 09 2025 19:37:23] [INFO] [tid 5829] Fabric Mode = 0
[May 09 2025 19:37:23] [INFO] [tid 5829] Fabric Mode Restart = 0
[May 09 2025 19:37:23] [INFO] [tid 5829] FM Library communication bind interface = 127.0.0.1
[May 09 2025 19:37:23] [INFO] [tid 5829] FM Library communication unix domain socket =
[May 09 2025 19:37:23] [INFO] [tid 5829] FM Library communication port number = 6666
[May 09 2025 19:37:23] [INFO] [tid 5829] Continue to run when facing failures = 0
[May 09 2025 19:37:23] [INFO] [tid 5829] Option when facing GPU to NVSwitch NVLink failure = 0
[May 09 2025 19:37:23] [INFO] [tid 5829] Option when facing NVSwitch to NVSwitch NVLink failure = 0
[May 09 2025 19:37:23] [INFO] [tid 5829] Option when facing NVSwitch failure = 0
[May 09 2025 19:37:23] [INFO] [tid 5829] Abort CUDA jobs when FM exits = 1
[May 09 2025 19:37:23] [INFO] [tid 5829] Fabric Manager - Subnet Manager IPC socket = unix:/var/run/nvidia-fabricmanager/fm_sm_ipc.socket
[May 09 2025 19:37:23] [INFO] [tid 5829] Fabric Manager - Subnet Manager management port GUID =
[May 09 2025 19:37:23] [INFO] [tid 5829] Disabling RPC mode for single node configuration.
[May 09 2025 19:37:23] [INFO] [tid 5829] GFM Wait Timeout = 360 secs
[May 09 2025 19:37:23] [INFO] [tid 5829] NVLink Domain ID : NDI-4AB75A84-010E-4701-9CF1-12C701384014
[May 09 2025 19:37:23] [INFO] [tid 5829] LMDB_LOG: Successfully initialized LMDB
[May 09 2025 19:37:24] [INFO] [tid 5829] Connected to 1 node.

[May 09 2025 19:37:24] [INFO] [tid 5829] Getting fabric node FM version info
[May 09 2025 19:37:24] [INFO] [tid 5829] getting NVSwitch device information
[May 09 2025 19:37:24] [INFO] [tid 5829] detected system topology is based on DGX/HGX H100
[May 09 2025 19:37:24] [INFO] [tid 5829] fabric topology file /usr/share/nvidia/nvswitch/dgxh100_hgxh100_topology is parsed.
[May 09 2025 19:37:24] [INFO] [tid 5829] number of devices specified in topology file NVSwitches: 4, GPUs: 8
[May 09 2025 19:37:24] [INFO] [tid 5829] getting NVSwitch device information
[May 09 2025 19:37:24] [INFO] [tid 5829] dumping all the detected NVSwitch information for Switch NodeId 0
Index: 00 Physical Id: 0 PCI Bus ID: 00000000:95:00.0 Enabled Link Mask: ffffffff00000000 Arch Type: 3 UUID : SWX-EBAA6F6B-F4AB-B324-1682-74BC9185A012 Num OSFP Cages: 0 TnvlEnabled: 0
Index: 01 Physical Id: 1 PCI Bus ID: 00000000:96:00.0 Enabled Link Mask: ffffffff000000ff Arch Type: 3 UUID : SWX-2166D26A-DF06-410E-AD7E-2B197418A149 Num OSFP Cages: 0 TnvlEnabled: 0
Index: 02 Physical Id: 2 PCI Bus ID: 00000000:97:00.0 Enabled Link Mask: ffffffff000f000f Arch Type: 3 UUID : SWX-9D0B9FFF-97F4-2D22-605B-557486CF2A8D Num OSFP Cages: 0 TnvlEnabled: 0
Index: 03 Physical Id: 3 PCI Bus ID: 00000000:98:00.0 Enabled Link Mask: ffffffff00000000 Arch Type: 3 UUID : SWX-37A41EDB-DAA7-6529-5CC0-86799CBC0A95 Num OSFP Cages: 0 TnvlEnabled: 0

[May 09 2025 19:37:24] [INFO] [tid 5829] number of GPU base board detected: 1
[May 09 2025 19:37:24] [INFO] [tid 5829] getting NVLink device information
[May 09 2025 19:37:24] [INFO] [tid 5829] NVLink Inband feature is enabled. Hence Fabric Manager is not opening and operating on GPUs directly.
[May 09 2025 19:37:24] [INFO] [tid 5829] NVLink Autonomous Link Initialization (ALI) feature is enabled.
[May 09 2025 19:37:24] [INFO] [tid 5829] start NVSwitch 0/0 routing configuration
[May 09 2025 19:37:24] [INFO] [tid 5829] completed NVSwitch 0/0 routing configuration
[May 09 2025 19:37:24] [INFO] [tid 5829] start NVSwitch 0/1 routing configuration
[May 09 2025 19:37:24] [INFO] [tid 5829] completed NVSwitch 0/1 routing configuration
[May 09 2025 19:37:24] [INFO] [tid 5829] start NVSwitch 0/2 routing configuration
[May 09 2025 19:37:24] [INFO] [tid 5829] completed NVSwitch 0/2 routing configuration
[May 09 2025 19:37:24] [INFO] [tid 5829] start NVSwitch 0/3 routing configuration
[May 09 2025 19:37:24] [INFO] [tid 5829] completed NVSwitch 0/3 routing configuration
[May 09 2025 19:37:24] [INFO] [tid 5829] Multi node HA error handling disabled
[May 09 2025 19:37:24] [INFO] [tid 5829] Successfully configured all the available NVSwitches to route GPU NVLink traffic. NVLink Peer-to-Peer support will be enabled once the GPUs are successfully registered with the NVLink fabric.
[May 09 2025 19:37:24] [INFO] [tid 5829] FM starting NvLink Inband messages
[May 09 2025 19:37:24] [INFO] [tid 5829] FM starting NvLink Inband started

but any CUDA application fail at initialisation:
/usr/local/cuda-12.9/extras/demo_suite/bandwidthTest -device all
[CUDA Bandwidth Test] - Starting…
cudaGetDeviceCount returned 3
→ initialization error

Tried even reinstall CUDA 12.8.1 (with drivers 570.133.20) but result is exactly same: CUDA doesn’t initialize

HW is ok a CUDA works with proprietary driver (but without support of HMM which I need)

Any idea where to dig/what to tweak?

I’m running into similar issues with Blackwell, Did you end up figuring this out?

No, to me it’s look like bug in open drivers (with 550.127.08 it works, unfortunatelly this version has serious security bug). In meantime I switched to proprietarry driver but project which requires HMM is on hold.
Still hope that NVidia make a fix…