Can't build drivers for OFED 4.9 in RHEL/CentOS 8.6 with 4.18.0-372.19.1.el8_6.x86_64 kernel

Hi,

I’m trying to install Mellanox OFED 4.9-5.1.0.0 LTS on the latest kernel of RHEL/Centos 8.6 (more strictly, Rocky Linux). I see in the release notes that the last supported version is 4.18.0-372.9.1.el8.x86_64 but my latest update (8.5 → 8.6) seems to have skipped 4.18.0-372.9.1.el8_6.x86_64 entirely in favor of 4.18.0-372.19.1.el8_6.x86_64. Mellanox OFED was not previously installed.

From what I can gather, the problem seems to be in some changes made to the kernel interface that conflict with some (now repeated) definitions in the drivers.

./mlnxofedinstall --distro RHEL8.6 --upstream-libs --add-kernel-support

ERROR: Failed executing "MLNX_OFED_SRC-4.9-5.1.0.0/install.pl --tmpdir /tmp/MLNX_OFED_LINUX-4.9-5.1.0.0-4.18.0-372.19.1.el8_6.x86_64/mlnx_iso.66001_logs --kernel-only --kernel 4.18.0-372.19.1.el8_6.x86_64 --kernel-sources /lib/modules/4.18.0-372.19.1.el8_6.x86_64/build --builddir /tmp/MLNX_OFED_LINUX-4.9-5.1.0.0-4.18.0-372.19.1.el8_6.x86_64/mlnx_iso.66001 --disable-kmp --build-only --distro rhel8.6"
ERROR: See /tmp/MLNX_OFED_LINUX-4.9-5.1.0.0-4.18.0-372.19.1.el8_6.x86_64/mlnx_iso.66001_logs/mlnx_ofed_iso.66001.log
Failed to build MLNX_OFED_LINUX for 4.18.0-372.19.1.el8_6.x86_64

Then in /tmp/MLNX_OFED_LINUX-4.9-5.1.0.0-4.18.0-372.19.1.el8_6.x86_64/mlnx_iso.66001_logs/mlnx_ofed_iso.66001.log

Build ofed-scripts 4.9 RPM
Running  rpmbuild --rebuild  --define '_topdir /tmp/MLNX_OFED_LINUX-4.9-5.1.0.0-4.18.0-372.19.1.el8_6.x86_64/mlnx_iso.66001/OFED_topdir' --define '_sourcedir %{_topdir}/SOURCES' --define '_specdir %{_topdir}/SPECS' --define '_srcrpmdir %{_topdir}/SRPMS' --define '_rpmdir %{_topdir}/RPMS'  --define 'dist %{nil}' --target x86_64 --define '_prefix /usr' --define '_exec_prefix /usr' --define '_sysconfdir /etc' --define '_usr /usr' /tmp/MLNX_OFED_LINUX-4.9-5.1.0.0-4.18.0-372.19.1.el8_6.x86_64/mlnx_iso.66001/MLNX_OFED_SRC-4.9-5.1.0.0/SRPMS/ofed-scripts-4.9-OFED.4.9.5.1.0.src.rpm
Build mlnx-ofa_kernel 4.9 RPM

-W- --with-mlx5-ipsec is enabled
Running  rpmbuild --rebuild  --define '_topdir /tmp/MLNX_OFED_LINUX-4.9-5.1.0.0-4.18.0-372.19.1.el8_6.x86_64/mlnx_iso.66001/OFED_topdir' --define '_sourcedir %{_topdir}/SOURCES' --define '_specdir %{_topdir}/SPECS' --define '_srcrpmdir %{_topdir}/SRPMS' --define '_rpmdir %{_topdir}/RPMS'  --nodeps --define '_dist .rhel8u6' --define 'configure_options   --with-core-mod --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod --with-mlxfw-mod --with-mlx4-mod --with-mlx4_en-mod --with-mlx5-mod --with-mlx5-ipsec --with-ipoib-mod --with-innova-flex --with-innova-ipsec --with-mdev-mod --with-srp-mod --with-iser-mod --with-isert-mod' --define 'KVERSION 4.18.0-372.19.1.el8_6.x86_64' --define 'K_SRC /lib/modules/4.18.0-372.19.1.el8_6.x86_64/build' --define '_prefix /usr' /tmp/MLNX_OFED_LINUX-4.9-5.1.0.0-4.18.0-372.19.1.el8_6.x86_64/mlnx_iso.66001/MLNX_OFED_SRC-4.9-5.1.0.0/SRPMS/mlnx-ofa_kernel-4.9-OFED.4.9.5.1.0.1.src.rpm
ESC[31mFailed to build mlnx-ofa_kernel 4.9 RPM[0m
Collecting debug info...
[31mSee /tmp/MLNX_OFED_LINUX-4.9-5.1.0.0-4.18.0-372.19.1.el8_6.x86_64/mlnx_iso.66001_logs/OFED.66369.logs/mlnx-ofa_kernel-4.9.rpmbuild.log[0m

In the rpmbuild.log some relevant errors

/tmp/MLNX_OFED_LINUX-4.9-5.1.0.0-4.18.0-372.19.1.el8_6.x86_64/mlnx_iso.66001/OFED_topdir/BUILD/mlnx-ofa_kernel-4.9/obj/default/include/linux/mm.h:15:21: error: conflicting types for 'kvzalloc'
   15 | static inline void *kvzalloc(unsigned long size,...) {
      |                     ^~~~~~~~
In file included from /tmp/MLNX_OFED_LINUX-4.9-5.1.0.0-4.18.0-372.19.1.el8_6.x86_64/mlnx_iso.66001/OFED_topdir/BUILD/mlnx-ofa_kernel-4.9/obj/default/include/linux/slab.h:6,
                 from include/linux/crypto.h:24,
                 from include/crypto/hash.h:16,
                 from include/linux/uio.h:16,
                 from include/linux/socket.h:8,
                 from /tmp/MLNX_OFED_LINUX-4.9-5.1.0.0-4.18.0-372.19.1.el8_6.x86_64/mlnx_iso.66001/OFED_topdir/BUILD/mlnx-ofa_kernel-4.9/obj/default/include/linux/socket.h:4,
                 from ./include/uapi/linux/if.h:25,
                 from /tmp/MLNX_OFED_LINUX-4.9-5.1.0.0-4.18.0-372.19.1.el8_6.x86_64/mlnx_iso.66001/OFED_topdir/BUILD/mlnx-ofa_kernel-4.9/obj/default/include/linux/compat-2.6.h:12,
                 from <command-line>:

mlnx-ofa_kernel-4.9.rpmbuild.log (880.1 KB)

(I’ve uploaded the full log since it’s too long to properly abbreviate)
In the kernel-devel file /usr/src/kernels/4.18.0-372.19.1.el8_6.x86_64/include/linux/slab.h the following definitions seem to be new with respect to other kernels (I took a look at the same file in a machine with an older kernel):

731 static inline void *kvzalloc_node(size_t size, gfp_t flags, int node)                                                                                     
732 {
733     return kvmalloc_node(size, flags | __GFP_ZERO, node);
734 }
735 static inline void *kvzalloc(size_t size, gfp_t flags)
736 {
737     return kvmalloc(size, flags | __GFP_ZERO);
738 }

Finally, $build_dir/OFED_topdir/BUILD/mlnx-ofa_kernel-4.9/source/include/linux/mm.h contains a conflicting definition:

 10 
 11 #ifndef HAVE_KVZALLOC
 12 #include <linux/vmalloc.h>
 13 #include <linux/slab.h>
 14 
 15 static inline void *kvzalloc(unsigned long size,...) {
 16     void *rtn;
 17 
 18     rtn = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
 19     if (!rtn)
 20         rtn = vzalloc(size);
 21     return rtn;
 22 }
 23 #endif

Looking at the full rpmbuild.log the same thing seems to be happening with kvcalloc, kvmalloc_array, kvmalloc_node , kvmalloc_array, etc. always in default/include/linux/slab.h and default/include/linux/mm.h.
I’m guessing it might be possible to adjust the compilation options to ignore local definitions and use the kernel’s, but:

  • I don’t see in the documentation if this would be possible or how to do it (Do I have to unpack the source, modify the makefile and pack again?)
  • I’m not familiar enough with the code to know if that would actually work or instead break more things.

I would really appreciate if someone could point me in the right direction. I’m guessing this will be addressed in the next release, but it would be cleaner to be able to compile instead of trying to downgrade everything to 8.5 while waiting for the release.

Regards,

Joaquín Torres.

HPC System Administrator.
Centro Atómico Constituyentes.
Comisión Nacional de Energía Atómica.
Villa Maipú. Buenos Aires, Argentina.

1 Like

Dear Mellanox support,

holds also for OFED 5.4-3.4.0.0 from the LTS branch of MOFED.

Cheers, Peter

2 Likes

Hello @torres2,

I’ve reproduced the issue and tested different building parameters and I have no luck with solution.
Looks like the best direction now is to wait for a new MLNX OFED release, which should include support for the new kernels.

If it works for you, you may temporary go back to 4.18.0-372.9.1.el8.x86_64 which is installed with Rocky 8.6 by default.
There is no issues with compiling driver for this kernel version.

Btw, have you tried to install OFED via YUM from mounted ISO image as a local repository?
MLNX_OFED supports KMP within the same OS release and it should work without recompile.

Regards,
Vladislav

Using the yum local repo appears to have worked for now. Thanks! Since the documentation specified that “unsupported kernels require rebuilding the drivers” I had assumed that both install methods were interchangeable, I just now noticed the comment noting KMP support. Is the script method preferred for some reason or is it just given more visibility for portability?

I did, however, notice some erratic behavior trying this out: The installation process seemingly added kernel-core 4.18.0-372.9.1.el8_6 as a dependency (It wasn’t previously installed, my latest upgrade went from rev 8.5 with a 4.18.0-348.20.1.el8_5 kernel straight to 4.18.0-372.19.1.el8_6). This is listed under dnf/yum list kernel-core but not under dnf/yum list kernel, so I’m guessing not all kernel components were installed. This seemed OK until I rebooted the server and the booting process just crashed with a lot of “warning dracut-initqueue timeout”, multiple problems recognizing partitions and finally a dracut root shell. Rebooting was working fine previously.

Apparently, the yum installation of kernel-core 4.18.0-372.9.1.el8_6 added a new boot option to grub and set it as default. When changing manually to 4.18.0-372.19.1.el8_6 the system booted correctly and Infiniband worked properly. I changed the default to 4.18.0-372.19.1.el8_6 with

$ grub2-set-default 0 # In my case this is the correct kernel
$ grub2-mkconfig -o /boot/efi/EFI/rocky/grub.cfg # I assume for RHEL this should be under 'redhat' instead of 'rocky'

My guess is, since only the kernel-core package was installed, the older kernel didn’t have all necessary components to properly boot the system and shouldn’t have been set to default. I don’t know much about KMP but I’m guessing the OFED packages still work using the old kernel-core as a dependency even when the currently loaded kernel is 4.18.0-372.19.1.el8_6.

It worries me a bit that the installation broke the booting process initially, but for now it seems to be working. I’ll keep testing to see if everything is in order.

Regards,
Joaquín.

Hi @vkhomyakov

Do you have an estimate for a new release for the 4.9.x LTS containing this fix? We recently needed to upgrade our Rocky 8 kernel to 4.18.0-372.26.1.el8_6.x86_64 in order to fix a kernel bug that prevented Intel MPI from working

I appreciate the workaround above, but given the clash in kernel function/data structures I’m hesitant to install a source code incompatible version of MLNX OFED on a production system.

1 Like

Hello @m.pacey

If you need to be sure about adding particular fix to the release I suggest you to open Support Case.
Without it we can’t guarantee any related improvements in LTS or non-LTS release.

Regards,
Vladislav

Hi @vkhomyakov

Thanks for the update. Can you advise on the best route to opening up a Support Case for this? I’ve tried the Support link on the driver page, but https://support.mellanox.com is currently unreachable as it’s serving an untrusted certificate

@m.pacey

Please check do you have an access to https://nvid.nvidia.com
Also you may reach out to support by email: enterprisesupport@nvidia.com

Regards,
Vladislav

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.