I always experience a strange error after a monthly update.
The OS is: Linux gpu4 4.18.0-553.5.1.el8_10.x86_64
Hardware configuration: 8 NVIDIA H100 80GB HBM3
When initializing a deep learning training, pytorch is not able to find the devices with error:
python3.10/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False
This error is often related to the nv-fabricmanager. However, the fabric manager is installed and running.
This is the output of the command “journalctl -u nvidia-fabricmanager”
Jun 09 17:18:42 gpu4 nv-fabricmanager[85131]: Connected to 1 node.
Jun 09 17:18:42 gpu4 nv-fabricmanager[85131]: Successfully configured all the available NVSwitches to route GPU NVLink traffic. NVLink Peer-to-Peer support will be enabled once the GPUs are successfully registered with the NVLink fabric.
Jun 09 17:18:42 gpu4 systemd[1]: Started NVIDIA fabric manager service.
The output of nvidia-smi also seems correct.
| NVIDIA-SMI 555.42.02 Driver Version: 555.42.02 CUDA Version: 12.5 |