MOFED 5.2 installer fails after Ubuntu kernel updated from 5.4.0-62 to 5.4.0-65

Anyone else have issues with the kernel patch or have a thought?

Per subject, and after a day wasted, I removed MOFED 5.1 and tried to install the latest 5.2. It fails with “Failed to build MLNX_OFED_LINUX for 5.4.0-65-generic”. I have narrowed it down to the install option “–add-kernel-support”. If I omit that but keep all other options, the MOFED installs fine, but with that one option the installer throws the error noted earlier.

Server seemed to be updated Friday night from Ubuntu 20.04.01 to .02, and the kernel advanced from 5.4.0-62 to -65. No other changes. It was working fine before the update. I was using it all week, rebooted at various times, and did not have any problems until it locked-up Friday night. Saturday morning it was dead and I had to wait until Monday to get someone in the data center to physically reboot it.

After a day of trying fixes, now it only boots into the BIOS screen. Doh! The last action was running the MOFED 5.2 installer without the option “–add-kernel-support” noted earlier. It said it updated firmware on my ConnectX-6 card and a reboot was needed. So, I rebooted, but the server never came back up. Don’t know if it is due to the kernel issue or the firmware issue, but I cannot get past the BIOS so further troubleshooting is unlikely.

My setup is a new Supermicro with dual AMD EPYC2 and 512GB RAM, a SATA boot drive and four NVMe data drives, a dual-port Mellanox ConnectX-6, and an NVIDIA A100 GPU. The server is dedicated to testing NVIDIA GPUDirect Storage (beta), so my MOFED install follows their instructions.

UPDATE

We rebuilt the server from scratch: we wiped the boot drive and re-loaded Ubuntu. MOFED install fails with the same errors as before. We did not load any other software.

The main log file:

Installing new packages

Building DEB for ofed-scripts-5.2 (ofed-scripts)…

Running /usr/bin/dpkg-buildpackage -us -uc

Building DEB for mlnx-ofed-kernel-utils-5.2 (mlnx-ofed-kernel)…

-W- --with-mlx5-ipsec is enabled

Running /usr/bin/dpkg-buildpackage -us -uc

^[[31mFailed to build mlnx-ofed-kernel DEB^[[0m

Collecting debug info…

^[[31mSee /tmp/MLNX_OFED_LINUX-5.2-1.0.4.0-5.4.0-65-generic/mlnx_iso.4078_logs/OFED.4311.logs/mlnx-ofed-kernel.debbuild.log^[[0m

The debbuild.log refers to a debug log. That’s a bit too long to post here.

Hello Mark,

Thank you for posting your inquiry on the NVIDIA Networking Community.

We released an update on MLNX_OFED 5.2 yesterday → https://content.mellanox.com/ofed/MLNX_OFED-5.2-2.2.0.0/MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64.tgz

This version installs without any issues. See below output:

Install syntax: # ./mlnxofedinstall --with-nvmf --with-nfsrdma --enable-gds -vvv

# lsb_release -ra ; uname -r ; ofed_info -s

No LSB modules are available.

Distributor ID: Ubuntu

Description: Ubuntu 20.04.2 LTS

Release: 20.04

Codename: focal

5.4.0-65-generic

MLNX_OFED_LINUX-5.2-2.2.0.0:

Snippet from install log (general.log)

Installing libdapl-dev-2.1.10.1.mlnx…

Running /usr/bin/dpkg -i --force-confmiss /var/tmp/MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/DEBS/libdapl-dev_2.1.10.1.mlnx-OFED.4.9.0.1.4.52220_amd64.deb

Installing dpcp-1.1.0…

Running /usr/bin/dpkg -i --force-confmiss /var/tmp/MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/DEBS/dpcp_1.1.0-1.52220_amd64.deb

Installing srptools-52mlnx1…

Running /usr/bin/dpkg -i --force-confmiss /var/tmp/MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/DEBS/srptools_52mlnx1-1.52220_amd64.deb

Installing mlnx-ethtool-5.8…

Running /usr/bin/dpkg -i --force-confmiss /var/tmp/MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/DEBS/mlnx-ethtool_5.8-1.52220_amd64.deb

Installing mlnx-iproute2-5.8.0…

Running /usr/bin/dpkg -i --force-confmiss /var/tmp/MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/DEBS/mlnx-iproute2_5.8.0-1.52220_amd64.deb

Running: FW_UPDATE_FLAGS=’–log /tmp/MLNX_OFED_LINUX.10306.logs/fw_update.log -v --tmpdir /tmp’ RUN_FW_UPDATER=‘yes’ /usr/bin/dpkg -i /var/tmp/MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/DEBS/mlnx-fw-updater_5.2-2.2.0.0_amd64.deb

Running: /usr/bin/dpkg-deb -x /var/tmp/MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/DEBS/mlnx-ofed-kernel-dkms_5.2-OFED.5.2.2.2.0.1_all.deb /var/tmp/mlnx-ofed-kernel_module-check 2>/dev/null

is_module_in_deb: ipoib is in /var/tmp/MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/DEBS/mlnx-ofed-kernel-dkms_5.2-OFED.5.2.2.2.0.1_all.deb

Installation passed successfully

To load the new driver, run:

/etc/init.d/openibd restart

Note: In order to load the new nvme-rdma and nvmet-rdma modules, the nvme module must be reloaded.

If you are still experiencing installation issues with this driver release, please open a NVEX Technical Support ticket by sending an email to → networking-support@nvidia.com

Thank you and regards,

~NVIDIA Networking Technical Support