Installing OFED with NVMe-oF on a System with OS Installed on NVMe

Hi OFED Driver Team,

I am installing OFED on a server with a ConnectX-7 adapter and aiming to enable NVMe-oF. However, I am encountering an issue because the server’s operating system is also installed on an NVMe SSD. After successfully installing OFED, I encountered the following message:

Note: In order to load the new nvme-rdma and nvmet-rdma modules, the nvme module must be reloaded.

$ sudo modprobe -r nvme
modprobe: FATAL: Module nvme is in use.

The problem appears to be that modprobe -r nvme fails due to the NVMe SSD being used as the system drive.

My question is: How can I safely reload the nvme module and load nvme-rdma and nvmet-rdma when the OS is also installed on an NVMe SSD?

I would appreciate any guidance on how to properly reload these modules without disrupting the OS on the NVMe SSD. Thank you for your assistance!


Environment Details:

$ uname -r
6.8.0-41-generic

$ lsb_release -a
No LSB modules are available.
Distributor ID:    Ubuntu
Description:    Ubuntu 24.04 LTS
Release:    24.04
Codename:    noble

$ ibstat
CA 'mlx5_0'
    CA type: MT4129
    Number of ports: 1
    Firmware version: 28.41.1000
    Hardware version: 0
    Node GUID: ----
    System image GUID: ----
    Port 1:
        State: Down
        Physical state: Disabled
        Rate: 400
        Base lid: 0
        LMC: 0
        SM lid: 0
        Capability mask: 0x00010000
        Port GUID: ----
        Link layer: Ethernet

OFED Installation Command Used:

wget "https://content.mellanox.com/ofed/MLNX_OFED-24.04-0.6.6.0/MLNX_OFED_LINUX-24.04-0.6.6.0-ubuntu24.04-x86_64.iso"
sudo mlnxofedinstall -vvv --with-nvmf --add-kernel-support --basic --force --force-fw-update

End of mlnxofedinstall Log:

Device (c1:00.0):
    c1:00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
    Link Width: x16
    PCI Link Speed: Unknown

Installation passed successfully
To load the new driver, run:
/etc/init.d/openibd restart
Note: In order to load the new nvme-rdma and nvmet-rdma modules, the nvme module must be reloaded.

dmesg Errors Encountered when Loading Modules (nvme-fabrics, nvme-rdma, nvmet, nvmet-rdma):

[ 1711.360216] nvme_fabrics: disagrees about version of symbol __nvme_submit_sync_cmd
[ 1711.360222] nvme_fabrics: Unknown symbol __nvme_submit_sync_cmd (err -22)

[ 1724.339810] nvmet: disagrees about version of symbol nvme_command_effects
[ 1724.339816] nvmet: Unknown symbol nvme_command_effects (err -22)
[ 1724.339857] nvmet: disagrees about version of symbol nvme_passthru_end
[ 1724.339859] nvmet: Unknown symbol nvme_passthru_end (err -22)
[ 1724.340038] nvmet: disagrees about version of symbol nvme_find_get_ns
[ 1724.340040] nvmet: Unknown symbol nvme_find_get_ns (err -22)
[ 1724.340063] nvmet: Unknown symbol nvme_find_noiob_from_bdev (err -2)
[ 1724.340141] nvmet: disagrees about version of symbol nvme_passthru_start
[ 1724.340144] nvmet: Unknown symbol nvme_passthru_start (err -22)
[ 1724.340219] nvmet: Unknown symbol nvme_find_pdev_from_bdev (err -2)
[ 1724.340254] nvmet: disagrees about version of symbol nvme_ctrl_from_file
[ 1724.340257] nvmet: Unknown symbol nvme_ctrl_from_file (err -22)
[ 1724.340299] nvmet: disagrees about version of symbol nvme_put_ns
[ 1724.340301] nvmet: Unknown symbol nvme_put_ns (err -22)
[ 1724.340345] nvmet: disagrees about version of symbol nvme_get_features
[ 1724.340347] nvmet: Unknown symbol nvme_get_features (err -22)

I also checked similar thread posts, but I think they can do sudo modprove -r nvme:

Hi

  1. You can reboot the kernel, and all the drivers will load
  2. Don’t try to remove the nvme (modprobe -r nvme) do only modprobe nvme-rdma
1 Like

Hi Gilh,

I’m following up on this issue, and thanks again for your tips.

Unfortunately, simply rebooting doesn’t resolve the problem. During boot, I can see in dmesg that systemd tries to load nvme-fabrics, but it fails, leading to the manual modprobe issues I mentioned earlier:

[   11.108976] systemd[1]: Starting modprobe@nvme_fabrics.service - Load Kernel Module nvme_fabrics...
[   11.115970] nvme_fabrics: disagrees about version of symbol __nvme_submit_sync_cmd
[   11.116412] nvme_fabrics: Unknown symbol __nvme_submit_sync_cmd (err -22)
[   11.136658] systemd[1]: modprobe@nvme_fabrics.service: Deactivated successfully.
[   11.137115] systemd[1]: Finished modprobe@nvme_fabrics.service - Load Kernel Module nvme_fabrics.

Additionally, there’s another kernel module bnxt_re loading issue during boot, though I’m not sure if it’s related:

[   12.762281] bnxt_re: disagrees about version of symbol ib_umem_release
[   12.762287] bnxt_re: Unknown symbol ib_umem_release (err -22)
[   12.762311] bnxt_re: disagrees about version of symbol ibdev_warn
[   12.762312] bnxt_re: Unknown symbol ibdev_warn (err -22)
[   12.762986] bnxt_re: disagrees about version of symbol uverbs_idr_class
[   12.762987] bnxt_re: Unknown symbol uverbs_idr_class (err -22)
[   12.763001] bnxt_re: disagrees about version of symbol rdma_read_gid_l2_fields
[   12.763002] bnxt_re: Unknown symbol rdma_read_gid_l2_fields (err -22)
[   12.763030] bnxt_re: disagrees about version of symbol rdma_read_gid_hw_context
[   12.763032] bnxt_re: Unknown symbol rdma_read_gid_hw_context (err -22)
[   12.763037] bnxt_re: disagrees about version of symbol ib_modify_qp_is_ok
[   12.763038] bnxt_re: Unknown symbol ib_modify_qp_is_ok (err -22)
[   12.763046] bnxt_re: disagrees about version of symbol ib_umem_find_best_pgsz
[   12.763047] bnxt_re: Unknown symbol ib_umem_find_best_pgsz (err -22)
[   12.763055] bnxt_re: disagrees about version of symbol ib_sg_to_pages
[   12.763056] bnxt_re: Unknown symbol ib_sg_to_pages (err -22)
[   12.763074] bnxt_re: disagrees about version of symbol uverbs_finalize_uobj_create
[   12.763075] bnxt_re: Unknown symbol uverbs_finalize_uobj_create (err -22)
[   12.763086] bnxt_re: disagrees about version of symbol _ib_alloc_device
[   12.763087] bnxt_re: Unknown symbol _ib_alloc_device (err -22)
[   12.763774] bnxt_re: disagrees about version of symbol ib_unregister_device
[   12.763776] bnxt_re: Unknown symbol ib_unregister_device (err -22)
[   12.763824] bnxt_re: disagrees about version of symbol rdma_user_mmap_entry_insert
[   12.763826] bnxt_re: Unknown symbol rdma_user_mmap_entry_insert (err -22)
[   12.763836] bnxt_re: disagrees about version of symbol ib_register_device
[   12.763837] bnxt_re: Unknown symbol ib_register_device (err -22)
[   12.763851] bnxt_re: disagrees about version of symbol rdma_user_mmap_entry_get
[   12.763852] bnxt_re: Unknown symbol rdma_user_mmap_entry_get (err -22)
[   12.763859] bnxt_re: disagrees about version of symbol ib_device_get_by_netdev
[   12.763860] bnxt_re: Unknown symbol ib_device_get_by_netdev (err -22)
[   12.763868] bnxt_re: disagrees about version of symbol rdma_user_mmap_entry_remove
[   12.763869] bnxt_re: Unknown symbol rdma_user_mmap_entry_remove (err -22)
[   12.763874] bnxt_re: disagrees about version of symbol _uverbs_get_const_unsigned
[   12.763875] bnxt_re: Unknown symbol _uverbs_get_const_unsigned (err -22)
[   12.763882] bnxt_re: disagrees about version of symbol ib_dispatch_event
[   12.763884] bnxt_re: Unknown symbol ib_dispatch_event (err -22)
[   12.763926] bnxt_re: disagrees about version of symbol ib_device_set_netdev
[   12.763927] bnxt_re: Unknown symbol ib_device_set_netdev (err -22)
[   12.763932] bnxt_re: disagrees about version of symbol ib_umem_get
[   12.763934] bnxt_re: Unknown symbol ib_umem_get (err -22)
[   12.763948] bnxt_re: disagrees about version of symbol ibdev_info
[   12.763949] bnxt_re: Unknown symbol ibdev_info (err -22)
[   12.763963] bnxt_re: disagrees about version of symbol ib_uverbs_get_ucontext_file
[   12.763964] bnxt_re: Unknown symbol ib_uverbs_get_ucontext_file (err -22)
[   12.763987] bnxt_re: disagrees about version of symbol ib_dealloc_device
[   12.763988] bnxt_re: Unknown symbol ib_dealloc_device (err -22)
[   12.763994] bnxt_re: disagrees about version of symbol rdma_user_mmap_io
[   12.763995] bnxt_re: Unknown symbol rdma_user_mmap_io (err -22)
[   12.764005] bnxt_re: disagrees about version of symbol rdma_user_mmap_entry_insert_range
[   12.764006] bnxt_re: Unknown symbol rdma_user_mmap_entry_insert_range (err -22)
[   12.764011] bnxt_re: disagrees about version of symbol ib_umem_dmabuf_get_pinned
[   12.764013] bnxt_re: Unknown symbol ib_umem_dmabuf_get_pinned (err -22)
[   12.764032] bnxt_re: disagrees about version of symbol ibdev_err
[   12.764033] bnxt_re: Unknown symbol ibdev_err (err -22)
[   12.764049] bnxt_re: disagrees about version of symbol uverbs_copy_to
[   12.764051] bnxt_re: Unknown symbol uverbs_copy_to (err -22)
[   12.764059] bnxt_re: disagrees about version of symbol uverbs_destroy_def_handler
[   12.764060] bnxt_re: Unknown symbol uverbs_destroy_def_handler (err -22)
[   12.764067] bnxt_re: disagrees about version of symbol ib_get_eth_speed
[   12.764068] bnxt_re: Unknown symbol ib_get_eth_speed (err -22)
[   12.764080] bnxt_re: disagrees about version of symbol ib_device_put
[   12.764081] bnxt_re: Unknown symbol ib_device_put (err -22)
[   12.764101] bnxt_re: disagrees about version of symbol ib_set_device_ops
[   12.764102] bnxt_re: Unknown symbol ib_set_device_ops (err -22)
[   12.764115] bnxt_re: disagrees about version of symbol rdma_user_mmap_entry_put
[   12.764116] bnxt_re: Unknown symbol rdma_user_mmap_entry_put (err -22)

The temporary conclusion is that the mlnxofed installation didn’t complete successfully, which seems to be causing version conflicts with the nvme-fabrics kernel module during the next reboot. Could you assist me in resolving this?

As a side note: using the same mlnxofed installation commands, I was able to successfully install and load the NVMe-oF drivers once. I’m not entirely sure why it worked that time, but I suspect it may have been because the firmware was updated first, followed by the mlnxofed installation. Since then, I’ve tried reinstalling (the firmware is not updated because already did), and I keep encountering the same driver issues.

Looking forward to your guidance!

I was able to resolve the issue by referring to this thread: -OFED NVMET and NVMET-RDMA on Ubuntu symbol errors

Additionally, I included the following commands to install and rebuild the driver and images:

sudo $MOUNT_POINT/mlnxofedinstall -vvv \
	--with-nvmf --add-kernel-support --force --force-fw-update # also install nvme-of
sudo /etc/init.d/openibd restart # load the new driver, might need rmmod some driver
sudo update-initramfs -u # rebuild the initramfs after the Mellanox modules were installed
sudo reboot

I hope this helps others facing a similar issue! :)

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.