RHEL 8.10 Kernel 4.18.0-553.62.1.el8_10.x86_64 breaks DOCA 3.0.0 and 2.9.3

It seems that with the release of Kernel 4.18.0-553.62.1.el8_10.x86_64 it started breaking 3rd party applications that relies on DOCA. Such as the BeeGFS Client. The same issue can be replicated with the latest 3.0.0 DOCA and the latest 2.9.3 LTS DOCA releases. When trying to compile BeeGFS Client Module with this combination: DOCA (2.9.3 or 3.0.0) and Kernel 4.18.0-553.62.1.el8_10.x86_64 and up, this happens:

/etc/init.d/beegfs-client rebuild
- BeeGFS module autobuild
$OFED_INCLUDE_PATH = 
$OFED_INCLUDE_PATH = [/usr/src/ofa_kernel/default/include]
$OFED_INCLUDE_PATH = [/usr/src/ofa_kernel/default/include]
Building beegfs client module

feature detection gives: -DKERNEL_HAS_INODE_ATIME -DKERNEL_HAS_DENTRY_SUBDIRS -DKERNEL_HAS_SCHED_SIG_H -DKERNEL_HAS_STATX -DKERNEL_HAS_KREF_READ -DKERNEL_HAS_FILE_DENTRY -DKERNEL_HAS_SUPER_SETUP_BDI_NAME -DKERNEL_HAS_KERNEL_READ -DKERNEL_HAS_SKWQ_HAS_SLEEPER -DKERNEL_HAS_CURRENT_TIME_SPEC64 -DKERNEL_WAKE_UP_SYNC_KEY_HAS_3_ARGUMENTS -DKERNEL_HAS_GET_FS -DKERNEL_HAS_IOV_ITER_KVEC_NO_TYPE_FLAG_IN_DIRECTION -DKERNEL_HAS_PRINT_STACK_TRACE -DKERNEL_HAS_SOCKPTR_T -DKERNEL_HAS_TIME64 -DKERNEL_HAS_KTIME_GET_TS64 -DKERNEL_HAS_KTIME_GET_REAL_TS64 -DKERNEL_HAS_KTIME_GET_COARSE_REAL_TS64 -DKERNEL_HAS_GENERIC_FILE_SPLICE_READ -DKERNEL_HAS_GENERIC_PERMISSION_2 -DKERNEL_HAS_SETATTR_PREPARE -DKERNEL_HAS_GET_ACL -DKERNEL_HAS_SET_ACL -DKERNEL_HAS_FOPS_ITERATE -DKERNEL_HAS_XATTR_HANDLERS_INODE_ARG
In file included from ./include/linux/mmzone.h:10,
from ./include/linux/gfp.h:6,
from /usr/src/ofa_kernel/default/include/linux/gfp.h:6,
from ./include/linux/slab.h:15,
from /usr/src/ofa_kernel/default/include/linux/slab.h:6,
from ./include/linux/crypto.h:24,
from ./include/crypto/hash.h:16,
from ./include/linux/uio.h:16,
from ./include/linux/socket.h:8,
from ./include/uapi/linux/if.h:25,
from /usr/src/ofa_kernel/default/include/linux/compat-2.6.h:11,
from :
./include/linux/spinlock.h:524:45: error: expected ‘)’ before ‘(’ token
DEFINE_LOCK_GUARD_1_COND(raw_spinlock, _try, raw_spin_trylock(_T->lock))
^
                                         )
In file included from ./include/linux/irqflags.h:16,
from ./arch/x86/include/asm/processor.h:35,
from ./arch/x86/include/asm/cpufeature.h:5,
from ./arch/x86/include/asm/thread_info.h:53,
from ./include/linux/thread_info.h:39,
from ./include/linux/uio.h:15,
from ./include/linux/socket.h:8,
from ./include/uapi/linux/if.h:25,
from /usr/src/ofa_kernel/default/include/linux/compat-2.6.h:11,
from :
/usr/src/ofa_kernel/default/include/linux/cleanup.h:148:3: warning: data definition has no type or storage class
} class_##_name##_t;       \
^~~~~~
/usr/src/ofa_kernel/default/include/linux/cleanup.h:174:1: note: in expansion of macro ‘__DEFINE_UNLOCK_GUARD’
__DEFINE_UNLOCK_GUARD(_name, _type, _unlock, _VA_ARGS_)  \
^~~~~~~~~~~~~~~~~~~~~
./include/linux/spinlock.h:526:1: note: in expansion of macro ‘DEFINE_LOCK_GUARD_1’
DEFINE_LOCK_GUARD_1(raw_spinlock_nested, raw_spinlock_t,
^~~~~~~~~~~~~~~~~~~
/usr/src/ofa_kernel/default/include/linux/cleanup.h:148:3: error: type defaults to ‘int’ in declaration of ‘class_raw_spinlock_nested_t’ [-Werror=implicit-int]
} class_##_name##_t;       \

.
.
.

/opt/beegfs/src/client/client_module_7/build/../source/common/nodes/NodeTree.h:56:11: note: in expansion of macro ‘rb_entry’
return rb_entry(this->value, Node, \_nodeTree.rbTreeElement);
^~~~~~~~
cc1: some warnings being treated as errors
make[3]: *** [scripts/Makefile.build:318: /opt/beegfs/src/client/client_module_7/build/../source/net/filesystem/FhgfsOpsCommKit.o] Error 1
cc1: some warnings being treated as errors
make[3]: *** [scripts/Makefile.build:318: /opt/beegfs/src/client/client_module_7/build/../source/net/filesystem/FhgfsOpsRemoting.o] Error 1
make[2]: *** [Makefile:1619: _module_/opt/beegfs/src/client/client_module_7/build/../source] Error 2
make[1]: *** [Makefile:200: module] Error 2
make: *** [AutoRebuild.mk:34: auto_rebuild] Error 2

That seems to be a complex issue that involves Red Hat and NVIDIA. But if we remove DOCA from the equation everything works, so it nails down to DOCA.

It seems that the /opt/mellanox/doca/tools/doca-kernel-support script has a bug. It generate packages that are versioned as downgrade from the supplied versions from Mellanox/NVIDIA. So when you try to run dnf update it does not update everything that it should update ending with a broken OFED installation.

If we try to force the installation of packages dnf considers that a downgrade and it must remove the doca-ofed metapackage:

[root@rhel810 4.18.0-553.70.1.el8_10.x86_64]# dnf install *.rpm --allowerasing
Updating Subscription Management repositories.
Last metadata expiration check: 0:55:09 ago on Fri 22 Aug 2025 11:54:32 AM -03.
Package kmod-iser-24.10-OFED.24.10.3.2.5.1.1.x86_64 is already installed.
Package kmod-isert-24.10-OFED.24.10.3.2.5.1.1.x86_64 is already installed.
Package kmod-kernel-mft-mlnx-4.30.1-1.1.x86_64 is already installed.
Package kmod-knem-1.1.4.90mlnx3-OFED.23.10.0.2.1.1.1.x86_64 is already installed.
Package kmod-mlnx-ofa_kernel-24.10-OFED.24.10.3.2.5.1.1.x86_64 is already installed.
Package kmod-srp-24.10-OFED.24.10.3.2.5.1.1.x86_64 is already installed.
Package kmod-xpmem-2.7.4-1.2410068.1.x86_64 is already installed.
Dependencies resolved.
==========================================================================================
 Package                          Arch   Version                      Repository     Size
==========================================================================================
Installing:
 doca-kernel-4.18.0.553.70.1.el8.10.x86.64
                                  noarch 24.10.3.2.5.0-1.kver.4.18.0.553.70.1.el8.10.x86.64
                                                                      @commandline  7.0 k
 fwctl-debugsource                x86_64 24.10-OFED.24.10.3.2.5.1     @commandline   16 k
 iser-debugsource                 x86_64 24.10-OFED.24.10.3.2.5.1     @commandline   35 k
 isert-debugsource                x86_64 24.10-OFED.24.10.3.2.5.1     @commandline   27 k
 kmod-fwctl                       x86_64 24.10-OFED.24.10.3.2.5.1.1   @commandline   20 k
 kmod-fwctl-debuginfo             x86_64 24.10-OFED.24.10.3.2.5.1.1   @commandline  310 k
 kmod-iser-debuginfo              x86_64 24.10-OFED.24.10.3.2.5.1.1   @commandline  534 k
 kmod-isert-debuginfo             x86_64 24.10-OFED.24.10.3.2.5.1.1   @commandline  326 k
 kmod-mlnx-nfsrdma                x86_64 24.10-OFED.24.10.3.2.5.1.1   @commandline   60 k
 kmod-mlnx-nfsrdma-debuginfo      x86_64 24.10-OFED.24.10.3.2.5.1.1   @commandline  1.4 M
 kmod-mlnx-nvme                   x86_64 24.10-OFED.24.10.3.2.5.1.1   @commandline  306 k
 kmod-mlnx-nvme-debuginfo         x86_64 24.10-OFED.24.10.3.2.5.1.1   @commandline  3.3 M
 kmod-mlnx-ofa_kernel-debuginfo   x86_64 24.10-OFED.24.10.3.2.5.1.1   @commandline   35 M
 kmod-srp-debuginfo               x86_64 24.10-OFED.24.10.3.2.5.1.1   @commandline  436 k
 knem                             x86_64 1.1.4.90mlnx3-OFED.23.10.0.2.1.1
                                                                      @commandline   56 k
 mlnx-nfsrdma-debugsource         x86_64 24.10-OFED.24.10.3.2.5.1     @commandline   81 k
 mlnx-nvme-debugsource            x86_64 24.10-OFED.24.10.3.2.5.1     @commandline  273 k
 mlnx-ofa_kernel-debugsource      x86_64 24.10-OFED.24.10.3.2.5.1     @commandline  1.7 M
 mlnx-ofa_kernel-devel-debuginfo  x86_64 24.10-OFED.24.10.3.2.5.1     @commandline  395 k
 srp-debugsource                  x86_64 24.10-OFED.24.10.3.2.5.1     @commandline   49 k
Removing dependent packages:
 doca-ofed                        x86_64 2.9.3-0.2.2                  @doca-2.9-lts   0  
Downgrading:
 mlnx-ofa_kernel                  x86_64 24.10-OFED.24.10.3.2.5.1     @commandline   39 k
 mlnx-ofa_kernel-devel            x86_64 24.10-OFED.24.10.3.2.5.1     @commandline  1.2 M
 mlnx-ofa_kernel-source           x86_64 24.10-OFED.24.10.3.2.5.1     @commandline  3.3 M
 xpmem                            x86_64 2.7.4-1.2410068              @commandline   20 k

Transaction Summary
==========================================================================================
Install    20 Packages
Remove      1 Package
Downgrade   4 Packages

Total size: 49 M

Forcing the install of the supposed “older” packages will fix the problem.

Hi,

Thanks for the detailed explanation of the issue.

To investigate it we will need to perform a deeper check of the scenario, and reproduce this issue in the lab.

This will require a new support case to be opened in Nvidia portal, or by sending an email to enterprisesupport@nvidia.com.

Then this case will be handled according to the entitlement.

Best Regards,

Anatoly