Modprobe: ERROR: could not insert 'nvme_rdma': Invalid argument

Description

I am trying to use NVME on an openshift cluster. I have built the nvme kernel with DTK and then I copy nvme and nvme-rdma to the RHCOS image and reboot the node

from the Nvidia-gpu-operator pod I get:

[08-Mar-25_05:03:48] NVIDIA driver container exec start
[08-Mar-25_05:03:48] Container full version: 24.10-0.7.0.0-0
[08-Mar-25_05:03:48] Verifying loaded modules will not prevent future driver restart
[08-Mar-25_05:03:48] Executing driver sources container
[08-Mar-25_05:03:48] Drivers inventory path is set: /mnt/drivers-inventory
[08-Mar-25_05:03:48] Unsetting driver ready state

[08-Mar-25_05:03:48] Deleting udev rules
[08-Mar-25_05:03:48] Query VFs info from [1] devices
[08-Mar-25_05:03:48] Query representors info from [1] devices
[08-Mar-25_05:03:48] Skipping driver build, reusing previously built packages for kernel 5.14.0-284.99.1.el9_2.x86_64

Verifying… ########################################
Preparing… ########################################
Updating / installing…
mlnx-ofa_kernel-debugsource-24.10-OFED########################################
kmod-mlnx-ofa_kernel-24.10-OFED.24.10.########################################
mlnx-tools-24.10-0.2410068 ########################################
mlnx-nvme-debugsource-24.10-OFED.24.10########################################
mlnx-nfsrdma-debugsource-24.10-OFED.24########################################
kmod-mlnx-nfsrdma-debuginfo-24.10-OFED########################################
kmod-mlnx-nvme-debuginfo-24.10-OFED.24########################################
mlnx-ofa_kernel-24.10-OFED.24.10.0.7.0########################################
Configured /etc/security/limits.conf
kmod-mlnx-nfsrdma-24.10-OFED.24.10.0.6########################################
kmod-mlnx-nvme-24.10-OFED.24.10.0.6.7.########################################
kmod-mlnx-ofa_kernel-debuginfo-24.10-O########################################
mlnx-ofa_kernel-devel-debuginfo-24.10-########################################
xpmem-2.7.4-1.2410068.rhel9u2 ########################################
ofed-scripts-24.10-OFED.24.10.0.7.0 ########################################
mlnx-ofa_kernel-source-24.10-OFED.24.1########################################
mlnx-ofa_kernel-devel-24.10-OFED.24.10########################################
kmod-xpmem-2.7.4-1.2410068.rhel9u2.rhe########################################
kmod-kernel-mft-mlnx-4.30.0-1.rhel9u2 ########################################

cat: /sys/module/nvme_rdma/srcversion: No such file or directory

ID=“rhcos”
VERSION_ID=“4.14”
RHEL_VERSION=“9.2”

[08-Mar-25_05:03:49] Apply blacklisted mofed modules file to host (/etc/modprobe.d/blacklist-ofed-modules.conf)
Function: generate_ofed_modules_blacklist
Unloading HCA driver:e[60G[ e[1;32mOKe[0;39m ]
Loading HCA driver and Access Layer:e[60G[ e[1;32mOKe[0;39m ]
[08-Mar-25_05:03:56] Remove blacklisted mofed modules file from host
[08-Mar-25_05:03:56] warning - nvme kernel module currently loaded does not match module from container

modprobe: ERROR: could not insert ‘nvme_rdma’: Invalid argument

[08-Mar-25_05:03:56] Command “modprobe nvme-rdma” failed with exit code: 1
[08-Mar-25_05:03:56] Remove blacklisted mofed modules file from host

I have the nvme.ko and nvme-rdma.ko in the host but I get

modprobe: ERROR: could not insert ‘nvme_rdma’: Invalid argument

when I try to install it there.

what am I doing wrong?

Environment

TensorRT Version:
GPU Type:
Nvidia Driver Version:
CUDA Version:
CUDNN Version:
Operating System + Version:
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered