Nvidia fabric manger initializing CUDA H100

jprieto2 · June 9, 2024, 9:28pm

I always experience a strange error after a monthly update.

The OS is: Linux gpu4 4.18.0-553.5.1.el8_10.x86_64
Hardware configuration: 8 NVIDIA H100 80GB HBM3

When initializing a deep learning training, pytorch is not able to find the devices with error:

python3.10/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False

This error is often related to the nv-fabricmanager. However, the fabric manager is installed and running.
This is the output of the command “journalctl -u nvidia-fabricmanager”

Jun 09 17:18:42 gpu4 nv-fabricmanager[85131]: Connected to 1 node.
Jun 09 17:18:42 gpu4 nv-fabricmanager[85131]: Successfully configured all the available NVSwitches to route GPU NVLink traffic. NVLink Peer-to-Peer support will be enabled once the GPUs are successfully registered with the NVLink fabric.
Jun 09 17:18:42 gpu4 systemd[1]: Started NVIDIA fabric manager service.

The output of nvidia-smi also seems correct.
| NVIDIA-SMI 555.42.02 Driver Version: 555.42.02 CUDA Version: 12.5 |

lovanto · July 4, 2024, 11:09pm

Hello jprieto2! Did you manage to solve the issue?

Topic		Replies	Views
CUDA device not initialized error on all calls, HGX A100, Centos 7 Linux cuda	9	4929	December 6, 2021
CUDA initialization error on 8x A100 GPU HGX server CUDA Setup and Installation	7	7498	November 4, 2023
CUDA initialization failure with error Error 802: system not yet initialized GPU - Hardware tensorrt , cuda , pytorch	9	1970	November 11, 2025
Error running cuda on VM with GPU passthrough. cuda.get_device_name() returns 802, not initialized CUDA Setup and Installation	6	1002	January 12, 2026
ERROR: The NVIDIA Driver is present, but CUDA failed to initialize. [[ System not yet initialized (error 802) ]] Mellanox OFED	1	117	December 16, 2025
Cuda 12.4 Driver Version: 565.57.0 CUDA Setup and Installation	1	702	December 19, 2024
Torch crashes driver on H100 CUDA Setup and Installation kernel	1	248	June 27, 2025
Nvidia-fabricmanager Error on H100 SXM: received NVLink inband message arrived on an NVLink xx which is not part of any active partition InfiniBand/VPI Switch Systems hw , nvbugs , ai	1	643	December 17, 2024
CUDA can't initialize after upgrade CUDA Setup and Installation	2	323	May 19, 2025
GH100 deviceQuery got cudaGetDeviceCount returned 802 CUDA Setup and Installation	1	794	March 4, 2024

Nvidia fabric manger initializing CUDA H100

Related topics