ConnectX-6 Driver Install: "Module not found: ib_umad"

I have a “Mellanox ConnectX-6 Dx” NIC and I’d like some help installing and configuring the drivers, specifically for using RDMA. I’m targetting Rocky 8.10 with kernel 4.18.0-553.72.1.el8_10.x86_64. I’ve read through a lot of the documentation on various software versions, and I believe the doca-networking profile has the correct software I need, see Which Profile to Install?. I followed the documentation for DOCA-Host Installation but am stopped when attempting to load the drivers:

[root@host_name ~]# /etc/init.d/openibd restart
mlx_compat is used by NVME. Leaving it loaded.             [WARNING]
Detected driver update. To load the new driver version rebo[WARNING]uired.
Unloading HCA driver:                                      [  OK  ]
mlx_compat is used by NVME. Leaving it loaded.             [WARNING]
Detected driver update. To load the new driver version rebo[WARNING]uired.
Unloading HCA driver:                                      [  OK  ]
Avoid loading inbox module: mlx5_ib                        [FAILED]
Loading Mellanox MLX5_IB HCA driver:                       [FAILED]
Module not found: ib_umad                                  [FAILED]
Module not found: ib_uverbs                                [FAILED]
Module not found: ib_ipoib                                 [FAILED]
No HCA kernel modules loaded:                              [FAILED]
Loading HCA driver and Access Layer:                       [FAILED]

Please run /usr/sbin/sysinfo-snapshot.py to collect the debug information
and open an issue in the http://support.mellanox.com/SupportWeb/service_center/SelfService

What are these error messages telling me? What have I done wrong? Here’s what I’ve attempted to check:

  1. The card is correctly powered and detected on the PCI bus:
[root@host_name ~]# lspci | grep "Mell"
8c:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
8c:00.1 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
  1. The card’s firmware was automatically updated when I installed mlnx-fw-updater the first time. It seemed successful and normal.
  2. doca-all, all required dnf repos, and all dependencies are the newest.
  3. My distro and kernel are supported as primary for doca-all, see Supported Host OS.
  4. I’ve attempted to uninstall and reinstall everything once already. Perhaps the uninstall scripts leave a misconfigured breadcrumb behind?

I’m running out of ideas of what else to check, so I’ll take any suggestions or questions, pretty please.

Hi @ZeHolyQofPower ,

Hope you are doing well!

Based on the existing logs and since I’m not aware your previous operations on this system, I can’t precisely identify the root cause of the issue.

Here are some advices for you and you can try again:

(1) Were there any error messages during the installation of DOCA? If the answer is “YES”, please investigate with the error messages.

or

(2) Reinstall your Rocky OS and then try to install doca-network/doca-all on a “clean” OS:

or

(3) Uninstall all related softwares about OFED or DOCA and then reinstall doca-network/doca-all again:

Best regards.

1 Like

Thank you for your time and idea Mr. @xidongs, but I did find a better solution, so I’m doing great!

To answer “what did I do to this thing previously?” to confess, I have no idea what I did or what I’m doing now. I was surprised when the drivers didn’t work OOTB after flashing the .iso. My very first dnf upgrade uninstalled the original DOCA driver and reinstalled the newest version automatically. Nothing from monitoring that showed any errors or hinted anything was wrong.

So how did I fix this?
I decided to roll back to the last LTS MLNX_OFED driver, download the source, and manualy compile it targetting my system. The greatest challenging was finding the correct documentation and download to do this. Here’s what I did, but as a disclaimer, I’m not re-doing this process a second time, I’m documenting it from memory so anybody trying this please let us know if I missed anything.

  1. Find the correct version of MLNX_OFED that’s the longest LTS from this chart. It’s 24.10-3.2.5.0.
  2. The documentation for various old MLNX versions is here now.
  3. Download the tarball for your specific distro here.
  4. I ran these commands:
tar xf MLNX_OFED_LINUX-24.10-3.2.5.0-rhel8.10-x86_64.tgz # Untar
cd MLNX_OFED_LINUX-23.10-1.1.9.0-rhel8.9-x86_64/         # Move into source
sudo ./mlnxofedinstall --with-nvmf --with-nfsrdma        # Run automated (re)installer with desired features. It will compile the binaries but fail to install all of them.
# This intentionally did not update my NIC's firmware because my previous DOCA installs did that correctly already.
# I intentionally skip the script to add kernel support because I'm working on a "primary" supported distro and kernel.
cd /tmp/MLNX_OFED_LINUX-24.10-3.2.5.0-4.18.0-553.72.1.el8_10.x86_64/MLNX_OFED_LINUX-24.10-3.2.5.0-rhel8.10-ext/   # Move to compiled binaries
sudo ./mlnxofedinstall --with-nvmf --with-nfsrdma        # Run this specific installer and it will succeed.
sudo dracut -f                                           # Regenerate the initramfs for automated driver startup on reboot
sudo reboot && exit                                      # Properly leave SSH while the system reboots

Voila, the NIC’s driver can be stopped and started and the physical ports are in my normal ifconfig now. I sure hope this unusual driver challenge doesn’t haunt me in any of my later networking setup. Let me know if anybody has any further questions about my setup, and I’ll try my best to answer.

Holy

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.