Hi,
I have an issue which is very similar to some other here on forum, however I could not find a solution.
When starting fabricmanager on Ubuntu 20.04, this is the error that I get:
● nvidia-fabricmanager.service - NVIDIA fabric manager service
Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Thu 2023-06-01 12:54:16 CEST; 1h 55min left
Process: 2965 ExecStart=/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg (code=exited, status=1/F>
jun 01 12:54:09 monster systemd[1]: Starting NVIDIA fabric manager service...
jun 01 12:54:16 monster nv-fabricmanager[2971]: request to query NVSwitch device information from NVSwitch driver failed with >
jun 01 12:54:16 monster systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE
jun 01 12:54:16 monster systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
jun 01 12:54:16 monster systemd[1]: Failed to start NVIDIA fabric manager service.
This is the output from nvidia-smi:
Thu Jun 1 11:10:37 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A4000 On | 00000000:17:00.0 Off | Off |
| 41% 41C P8 13W / 140W| 420MiB / 16376MiB | 2% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80GB PCIe On | 00000000:31:00.0 Off | 0 |
| N/A 50C P0 60W / 300W| 5MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100 80GB PCIe On | 00000000:4B:00.0 Off | 0 |
| N/A 48C P0 58W / 300W| 5MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100 80GB PCIe On | 00000000:B1:00.0 Off | 0 |
| N/A 49C P0 60W / 300W| 5MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA A100 80GB PCIe On | 00000000:CA:00.0 Off | 0 |
| N/A 49C P0 59W / 300W| 5MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2794 G /usr/lib/xorg/Xorg 128MiB |
| 0 N/A N/A 3190 G /usr/bin/gnome-shell 90MiB |
| 0 N/A N/A 4244 G /usr/lib/firefox/firefox 199MiB |
| 1 N/A N/A 2794 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 2794 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 2794 G /usr/lib/xorg/Xorg 4MiB |
| 4 N/A N/A 2794 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+
Can you please advise how to solve this?
I have matching versions of nvidia-driver and cuda-drivers-fabricmanager installed (530.30.02). I also have SR-IOV Enabled in BIOS.