Anyone else have issues with the kernel patch or have a thought?
Per subject, and after a day wasted, I removed MOFED 5.1 and tried to install the latest 5.2. It fails with “Failed to build MLNX_OFED_LINUX for 5.4.0-65-generic”. I have narrowed it down to the install option “–add-kernel-support”. If I omit that but keep all other options, the MOFED installs fine, but with that one option the installer throws the error noted earlier.
Server seemed to be updated Friday night from Ubuntu 20.04.01 to .02, and the kernel advanced from 5.4.0-62 to -65. No other changes. It was working fine before the update. I was using it all week, rebooted at various times, and did not have any problems until it locked-up Friday night. Saturday morning it was dead and I had to wait until Monday to get someone in the data center to physically reboot it.
After a day of trying fixes, now it only boots into the BIOS screen. Doh! The last action was running the MOFED 5.2 installer without the option “–add-kernel-support” noted earlier. It said it updated firmware on my ConnectX-6 card and a reboot was needed. So, I rebooted, but the server never came back up. Don’t know if it is due to the kernel issue or the firmware issue, but I cannot get past the BIOS so further troubleshooting is unlikely.
My setup is a new Supermicro with dual AMD EPYC2 and 512GB RAM, a SATA boot drive and four NVMe data drives, a dual-port Mellanox ConnectX-6, and an NVIDIA A100 GPU. The server is dedicated to testing NVIDIA GPUDirect Storage (beta), so my MOFED install follows their instructions.