Problem starting fabricmanager in Ubuntu 20.04 LTS

Hi,

I have an issue which is very similar to some other here on forum, however I could not find a solution.
When starting fabricmanager on Ubuntu 20.04, this is the error that I get:

● nvidia-fabricmanager.service - NVIDIA fabric manager service
     Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Thu 2023-06-01 12:54:16 CEST; 1h 55min left
    Process: 2965 ExecStart=/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg (code=exited, status=1/F>

jun 01 12:54:09 monster systemd[1]: Starting NVIDIA fabric manager service...
jun 01 12:54:16 monster nv-fabricmanager[2971]: request to query NVSwitch device information from NVSwitch driver failed with >
jun 01 12:54:16 monster systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE
jun 01 12:54:16 monster systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
jun 01 12:54:16 monster systemd[1]: Failed to start NVIDIA fabric manager service.

This is the output from nvidia-smi:

Thu Jun  1 11:10:37 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A4000                On | 00000000:17:00.0 Off |                  Off |
| 41%   41C    P8               13W / 140W|    420MiB / 16376MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe           On | 00000000:31:00.0 Off |                    0 |
| N/A   50C    P0               60W / 300W|      5MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100 80GB PCIe           On | 00000000:4B:00.0 Off |                    0 |
| N/A   48C    P0               58W / 300W|      5MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100 80GB PCIe           On | 00000000:B1:00.0 Off |                    0 |
| N/A   49C    P0               60W / 300W|      5MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100 80GB PCIe           On | 00000000:CA:00.0 Off |                    0 |
| N/A   49C    P0               59W / 300W|      5MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2794      G   /usr/lib/xorg/Xorg                          128MiB |
|    0   N/A  N/A      3190      G   /usr/bin/gnome-shell                         90MiB |
|    0   N/A  N/A      4244      G   /usr/lib/firefox/firefox                    199MiB |
|    1   N/A  N/A      2794      G   /usr/lib/xorg/Xorg                            4MiB |
|    2   N/A  N/A      2794      G   /usr/lib/xorg/Xorg                            4MiB |
|    3   N/A  N/A      2794      G   /usr/lib/xorg/Xorg                            4MiB |
|    4   N/A  N/A      2794      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+

Can you please advise how to solve this?

I have matching versions of nvidia-driver and cuda-drivers-fabricmanager installed (530.30.02). I also have SR-IOV Enabled in BIOS.

fabric manager isn’t applicable to your system or to any of the GPUs in your system. It is intended for HGX platforms that include SXM GPUs with NVSwitch connectivity. It is not applicable to PCIE GPUs.

1 Like

Many thanks Robert for you response.

I understand. However, we have cards linked with NVLink bridges like this:

Do you have any suggestion how to utilise the NVSwitch connectivity in Ubuntu with 4xA100 GPUs?

Furthermore, nvidia-smi topo -m gives the following output:

fabric manager isn’t applicable to your system. There is no NVSwitch in your system.

Thanks!
So no need to install it on Ubuntu20 with PCI-E GTX1070?
But how to solve this problem?

jag@Aigen:~/codes/cuda-samples-master/bin/x86_64/linux/release$ ./bandwidthTest

[CUDA Bandwidth Test] - Starting…

Running on…

cudaGetDeviceProperties returned 802

→ system not yet initialized

CUDA error at bandwidthTest.cu:256 code=802(cudaErrorSystemNotReady) “cudaSetDevice(currentDevice)”

It’s a broken CUDA install. I don’t know how its broken exactly. There are instructions available for how to install CUDA plus numerous forum questions about it. The usual methodology that I can think of to fix this would be to clean out all old CUDA bits everywhere on your machine and reinstall CUDA. If all else fails, reinstall the OS and reinstall CUDA.