Hello, I am trying to install fabricmanager and the service is failing to start up. I have a fresh google cloud ubuntu VM instance with 4 A100 GPUs. I am wishing to take advantage of the NVSwitch features for multiple GPUs.
Machine : Ubuntu 18.04.6 LTS
Fabric Manager version is : 450.80.02
$ sudo systemctl start nvidia-fabricmanager.service
Job for nvidia-fabricmanager.service failed because the control process exited with error code.
See "systemctl status nvidia-fabricmanager.service" and "journalctl -xe" for details.
$ sudo systemctl status nvidia-fabricmanager.service
-- Logs begin at Fri 2021-09-24 12:08:13 UTC, end at Fri 2021-09-24 14:57:24 UTC. --
Sep 24 14:46:43 gpu4 systemd[1]: Starting NVIDIA fabric manager service...
Sep 24 14:46:43 gpu4 nv-fabricmanager[2707]: request to query NVSwitch device information from NVSwitch driver failed with
Sep 24 14:46:43 gpu4 systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited status=1
Sep 24 14:46:43 gpu4 systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
Sep 24 14:46:43 gpu4 systemd[1]: Failed to start NVIDIA fabric manager service.
Sep 24 14:49:52 gpu4 systemd[1]: Starting NVIDIA fabric manager service...
Sep 24 14:49:52 gpu4 nv-fabricmanager[2746]: request to query NVSwitch device information from NVSwitch driver failed with
Sep 24 14:49:52 gpu4 systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited status=1
Sep 24 14:49:52 gpu4 systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
Sep 24 14:49:52 gpu4 systemd[1]: Failed to start NVIDIA fabric manager service.
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.142.00 Driver Version: 450.142.00 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB On | 00000000:00:04.0 Off | 0 |
| N/A 34C P0 53W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-SXM4-40GB On | 00000000:00:05.0 Off | 0 |
| N/A 33C P0 52W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 A100-SXM4-40GB On | 00000000:00:06.0 Off | 0 |
| N/A 33C P0 55W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 A100-SXM4-40GB On | 00000000:00:07.0 Off | 0 |
| N/A 34C P0 52W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
$ cat /var/log/fabricmanager.log
Fabric Manager Log initializing at: 9/24/2021 15:05:09.601
[Sep 24 2021 15:05:09] [INFO] [tid 2870] Fabric Manager version 450.142.00 is running with the following configration options
[Sep 24 2021 15:05:09] [INFO] [tid 2870] Logging level = 4
[Sep 24 2021 15:05:09] [INFO] [tid 2870] Logging file name/path = /var/log/fabricmanager.log
[Sep 24 2021 15:05:09] [INFO] [tid 2870] Append to log file = 1
[Sep 24 2021 15:05:09] [INFO] [tid 2870] Max Log file size = 1024 (MBs)
[Sep 24 2021 15:05:09] [INFO] [tid 2870] Use Syslog file = 0
[Sep 24 2021 15:05:09] [INFO] [tid 2870] Fabric Manager communication ports = 16000
[Sep 24 2021 15:05:09] [INFO] [tid 2870] Shared Fabric Mode Status = 0
[Sep 24 2021 15:05:09] [INFO] [tid 2870] Shared Fabric Mode Restart = 0
[Sep 24 2021 15:05:09] [INFO] [tid 2870] FM Library communication bind interface = 127.0.0.1
[Sep 24 2021 15:05:09] [INFO] [tid 2870] FM Library communication unix domain socket =
[Sep 24 2021 15:05:09] [INFO] [tid 2870] FM Library communication port number = 6666
[Sep 24 2021 15:05:09] [INFO] [tid 2870] Continue to run when facing failures = 0
[Sep 24 2021 15:05:09] [INFO] [tid 2870] Option when facing GPU to NVSwitch NVLink failure = 0
[Sep 24 2021 15:05:09] [INFO] [tid 2870] Option when facing NVSwitch to NVSwitch NVLink failure = 0
[Sep 24 2021 15:05:09] [INFO] [tid 2870] Option when facing NVSwitch failure = 0
[Sep 24 2021 15:05:09] [INFO] [tid 2870] Abort CUDA jobs when FM exits = 1
[Sep 24 2021 15:05:09] [ERROR] [tid 2870] request to query NVSwitch device information from NVSwitch driver failed with error:WARNING Nothing to do [**NV_WARN_NOTHING_TO_DO**]