Fabric Manager Installation

Hello, I am trying to install fabricmanager and the service is failing to start up. I have a fresh google cloud ubuntu VM instance with 4 A100 GPUs. I am wishing to take advantage of the NVSwitch features for multiple GPUs.

Machine : Ubuntu 18.04.6 LTS
Fabric Manager version is : 450.80.02

$ sudo systemctl start nvidia-fabricmanager.service

Job for nvidia-fabricmanager.service failed because the control process exited with error code.
See "systemctl status nvidia-fabricmanager.service" and "journalctl -xe" for details.
$ sudo systemctl status nvidia-fabricmanager.service

-- Logs begin at Fri 2021-09-24 12:08:13 UTC, end at Fri 2021-09-24 14:57:24 UTC. --
Sep 24 14:46:43 gpu4 systemd[1]: Starting NVIDIA fabric manager service...
Sep 24 14:46:43 gpu4 nv-fabricmanager[2707]: request to query NVSwitch device information from NVSwitch driver failed with
Sep 24 14:46:43 gpu4 systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited status=1
Sep 24 14:46:43 gpu4 systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
Sep 24 14:46:43 gpu4 systemd[1]: Failed to start NVIDIA fabric manager service.
Sep 24 14:49:52 gpu4 systemd[1]: Starting NVIDIA fabric manager service...
Sep 24 14:49:52 gpu4 nv-fabricmanager[2746]: request to query NVSwitch device information from NVSwitch driver failed with
Sep 24 14:49:52 gpu4 systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited status=1
Sep 24 14:49:52 gpu4 systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
Sep 24 14:49:52 gpu4 systemd[1]: Failed to start NVIDIA fabric manager service.
$ nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.142.00   Driver Version: 450.142.00   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      On   | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    53W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-SXM4-40GB      On   | 00000000:00:05.0 Off |                    0 |
| N/A   33C    P0    52W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  A100-SXM4-40GB      On   | 00000000:00:06.0 Off |                    0 |
| N/A   33C    P0    55W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  A100-SXM4-40GB      On   | 00000000:00:07.0 Off |                    0 |
| N/A   34C    P0    52W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
$ cat /var/log/fabricmanager.log

Fabric Manager Log initializing at: 9/24/2021 15:05:09.601
[Sep 24 2021 15:05:09] [INFO] [tid 2870] Fabric Manager version 450.142.00 is running with the following configration options
[Sep 24 2021 15:05:09] [INFO] [tid 2870] Logging level = 4
[Sep 24 2021 15:05:09] [INFO] [tid 2870] Logging file name/path = /var/log/fabricmanager.log
[Sep 24 2021 15:05:09] [INFO] [tid 2870] Append to log file = 1
[Sep 24 2021 15:05:09] [INFO] [tid 2870] Max Log file size = 1024 (MBs)
[Sep 24 2021 15:05:09] [INFO] [tid 2870] Use Syslog file = 0
[Sep 24 2021 15:05:09] [INFO] [tid 2870] Fabric Manager communication ports = 16000
[Sep 24 2021 15:05:09] [INFO] [tid 2870] Shared Fabric Mode Status = 0
[Sep 24 2021 15:05:09] [INFO] [tid 2870] Shared Fabric Mode Restart = 0
[Sep 24 2021 15:05:09] [INFO] [tid 2870] FM Library communication bind interface = 127.0.0.1
[Sep 24 2021 15:05:09] [INFO] [tid 2870] FM Library communication unix domain socket = 
[Sep 24 2021 15:05:09] [INFO] [tid 2870] FM Library communication port number = 6666
[Sep 24 2021 15:05:09] [INFO] [tid 2870] Continue to run when facing failures = 0
[Sep 24 2021 15:05:09] [INFO] [tid 2870] Option when facing GPU to NVSwitch NVLink failure = 0
[Sep 24 2021 15:05:09] [INFO] [tid 2870] Option when facing NVSwitch to NVSwitch NVLink failure = 0
[Sep 24 2021 15:05:09] [INFO] [tid 2870] Option when facing NVSwitch failure = 0
[Sep 24 2021 15:05:09] [INFO] [tid 2870] Abort CUDA jobs when FM exits = 1
[Sep 24 2021 15:05:09] [ERROR] [tid 2870] request to query NVSwitch device information from NVSwitch driver failed with error:WARNING Nothing to do [**NV_WARN_NOTHING_TO_DO**]
2 Likes

Have you resolved this problem?

You should keep the Fabric Manager version the same as the Driver version.

the same question as the above。the driver version is the same as Fabric Manager version.