I have a “Mellanox ConnectX-6 Dx” NIC and I’d like some help installing and configuring the drivers, specifically for using RDMA. I’m targetting Rocky 8.10 with kernel 4.18.0-553.72.1.el8_10.x86_64. I’ve read through a lot of the documentation on various software versions, and I believe the doca-networking profile has the correct software I need, see Which Profile to Install?. I followed the documentation for DOCA-Host Installation but am stopped when attempting to load the drivers:
[root@host_name ~]# /etc/init.d/openibd restart
mlx_compat is used by NVME. Leaving it loaded. [WARNING]
Detected driver update. To load the new driver version rebo[WARNING]uired.
Unloading HCA driver: [ OK ]
mlx_compat is used by NVME. Leaving it loaded. [WARNING]
Detected driver update. To load the new driver version rebo[WARNING]uired.
Unloading HCA driver: [ OK ]
Avoid loading inbox module: mlx5_ib [FAILED]
Loading Mellanox MLX5_IB HCA driver: [FAILED]
Module not found: ib_umad [FAILED]
Module not found: ib_uverbs [FAILED]
Module not found: ib_ipoib [FAILED]
No HCA kernel modules loaded: [FAILED]
Loading HCA driver and Access Layer: [FAILED]
Please run /usr/sbin/sysinfo-snapshot.py to collect the debug information
and open an issue in the http://support.mellanox.com/SupportWeb/service_center/SelfService
What are these error messages telling me? What have I done wrong? Here’s what I’ve attempted to check:
The card is correctly powered and detected on the PCI bus:
Thank you for your time and idea Mr. @xidongs, but I did find a better solution, so I’m doing great!
To answer “what did I do to this thing previously?” to confess, I have no idea what I did or what I’m doing now. I was surprised when the drivers didn’t work OOTB after flashing the .iso. My very first dnf upgrade uninstalled the original DOCA driver and reinstalled the newest version automatically. Nothing from monitoring that showed any errors or hinted anything was wrong.
So how did I fix this?
I decided to roll back to the last LTS MLNX_OFED driver, download the source, and manualy compile it targetting my system. The greatest challenging was finding the correct documentation and download to do this. Here’s what I did, but as a disclaimer, I’m not re-doing this process a second time, I’m documenting it from memory so anybody trying this please let us know if I missed anything.
Find the correct version of MLNX_OFED that’s the longest LTS from this chart. It’s 24.10-3.2.5.0.
The documentation for various old MLNX versions is here now.
Download the tarball for your specific distro here.
I ran these commands:
tar xf MLNX_OFED_LINUX-24.10-3.2.5.0-rhel8.10-x86_64.tgz # Untar
cd MLNX_OFED_LINUX-23.10-1.1.9.0-rhel8.9-x86_64/ # Move into source
sudo ./mlnxofedinstall --with-nvmf --with-nfsrdma # Run automated (re)installer with desired features. It will compile the binaries but fail to install all of them.
# This intentionally did not update my NIC's firmware because my previous DOCA installs did that correctly already.
# I intentionally skip the script to add kernel support because I'm working on a "primary" supported distro and kernel.
cd /tmp/MLNX_OFED_LINUX-24.10-3.2.5.0-4.18.0-553.72.1.el8_10.x86_64/MLNX_OFED_LINUX-24.10-3.2.5.0-rhel8.10-ext/ # Move to compiled binaries
sudo ./mlnxofedinstall --with-nvmf --with-nfsrdma # Run this specific installer and it will succeed.
sudo dracut -f # Regenerate the initramfs for automated driver startup on reboot
sudo reboot && exit # Properly leave SSH while the system reboots
Voila, the NIC’s driver can be stopped and started and the physical ports are in my normal ifconfig now. I sure hope this unusual driver challenge doesn’t haunt me in any of my later networking setup. Let me know if anybody has any further questions about my setup, and I’ll try my best to answer.