Nvidia-fabricmanager fail with ib_umad module

Hello,

I’m currently facing an issue with nvidia-fabricmanager and would like to ask for some guidance.

I’m using a system equipped with an HGX-B200 GPU board and ConnectX-7, with the following software versions:

  • OS: Rocky Linux 9.6

  • Kernel: 5.14.0

  • NVIDIA Driver: 580.105.08

  • NVIDIA Fabric Manager: 580.105.08

  • NVIDIA NSCQ: 580.105.08

  • DOCA: DOCA Host (doca-all) 3.2.0 LTS

The problem is that nvidia-fabricmanager fails to start because the ib_umad module is not loaded automatically.
If I load the ib_umad module manually, Fabric Manager starts and works normally.

I would like to know what might be causing this issue, or if there are any known configuration requirements for automatic module loading.

In addition, I’d like to confirm whether the nvlsm package is mandatory when using Fabric Manager on systems with HGX-B200.
On our previous system equipped with HGX-H200, we did not experience this issue.

Any insight or recommendations would be greatly appreciated.
Thank you!

2 Likes

I recently setup a B200 cluster and had to install DOCA before the NVIDIA Driver to get ib_umad to load automatically. Otherwise if you install DOCA afterwards you’ll have to load it. I had to install nvlsm package as well as it’s a requirement for B200/B300

On DGX-B200/B300, NVIDIA HGX-B200/B300, NVIDIA HGX-B100 systems and later, the FM package needs an additional NVLSM dependency to get the SM package for proper operation. The FM service unit file is also updated to start the NVLSM process if applicable. In this case, the FM systemd service status indicates the process status for FM and NVLSM, and operations such as systemd start, stop, and so on will operate on both processes.