How to Compile NVMe Modules with DOCA Installation

Hi,

I’m running into an issue related to NVMe-over-RDMA after installing DOCA. Previously, I installed MLNX_OFED manually using:

sudo ./mlnxofedinstall --with-nvmf --with-nfsrdma --enable-gds --add-kernel-support --dkms

This method allowed me to include extra flags like --with-nvmf, which compiled the NVMe-related modules from MLNX_OFED sources. These were placed in:

/lib/modules/5.15.134/updates/dkms/

When I ran modprobe nvme-rdma, it would prioritize the modules from this directory and be compatible with the rest of MLNX_OFED (e.g., ib_core).

However, I learned from this doc that

MLNX_OFED has transitioned into DOCA-Host, and now available as DOCA-OFED (learn about DOCA-Host profiles here).

MLNX_OFED last standalone release is October 2024 Long Term Support (3 years). Starting January 2025 all new features will be included in DOCA-OFED only.

So I decided to install doca. When I install DOCA via:

sudo apt install doca-all
sudo apt install doca
~$ ofed_info -s
OFED-internal-24.04-0.7.0:

I am not sure, but it seems the NVMe-related modules (i.e., nvme.ko, nvme-rdma.ko) are not built during this process. So when I do ·modprobe nvme-rdma·, it loads the default kernel modules from:
/lib/modules/5.15.134/kernel/drivers/nvme/host/
These are built by the Linux kernel and not compatible with the rest of MLNX_OFED, leading to symbol or dependency mismatches.

~$ sudo modprobe nvme-rdma
modprobe: ERROR: could not insert 'nvme_rdma': Invalid argument
[ 6999.636085] nvme_rdma: disagrees about version of symbol ib_mr_pool_destroy
[ 6999.636088] nvme_rdma: Unknown symbol ib_mr_pool_destroy (err -22)
[ 6999.636095] nvme_rdma: disagrees about version of symbol ib_unregister_client
[ 6999.636096] nvme_rdma: Unknown symbol ib_unregister_client (err -22)
[ 6999.636114] nvme_rdma: disagrees about version of symbol rdma_reject_msg
[ 6999.636115] nvme_rdma: Unknown symbol rdma_reject_msg (err -22)
[ 6999.636142] nvme_rdma: disagrees about version of symbol __ib_alloc_pd
[ 6999.636144] nvme_rdma: Unknown symbol __ib_alloc_pd (err -22)
[ 6999.636175] nvme_rdma: disagrees about version of symbol rdma_resolve_addr
[ 6999.636176] nvme_rdma: Unknown symbol rdma_resolve_addr (err -22)
[ 6999.636182] nvme_rdma: disagrees about version of symbol rdma_set_service_type
[ 6999.636183] nvme_rdma: Unknown symbol rdma_set_service_type (err -22)
[ 6999.636187] nvme_rdma: disagrees about version of symbol ib_map_mr_sg_pi
[ 6999.636188] nvme_rdma: Unknown symbol ib_map_mr_sg_pi (err -22)
[ 6999.636195] nvme_rdma: disagrees about version of symbol ib_mr_pool_init
[ 6999.636195] nvme_rdma: Unknown symbol ib_mr_pool_init (err -22)
[ 6999.636199] nvme_rdma: disagrees about version of symbol ib_process_cq_direct
[ 6999.636200] nvme_rdma: Unknown symbol ib_process_cq_direct (err -22)
[ 6999.636212] nvme_rdma: disagrees about version of symbol ib_event_msg
[ 6999.636212] nvme_rdma: Unknown symbol ib_event_msg (err -22)
[ 6999.636220] nvme_rdma: disagrees about version of symbol rdma_disconnect
[ 6999.636220] nvme_rdma: Unknown symbol rdma_disconnect (err -22)
[ 6999.636240] nvme_rdma: disagrees about version of symbol __rdma_create_kernel_id
[ 6999.636240] nvme_rdma: Unknown symbol __rdma_create_kernel_id (err -22)
[ 6999.636253] nvme_rdma: disagrees about version of symbol rdma_resolve_route
[ 6999.636254] nvme_rdma: Unknown symbol rdma_resolve_route (err -22)
[ 6999.636260] nvme_rdma: disagrees about version of symbol ib_register_client
[ 6999.636261] nvme_rdma: Unknown symbol ib_register_client (err -22)
[ 6999.636266] nvme_rdma: disagrees about version of symbol rdma_create_qp
[ 6999.636267] nvme_rdma: Unknown symbol rdma_create_qp (err -22)
[ 6999.636272] nvme_rdma: disagrees about version of symbol ib_map_mr_sg
[ 6999.636273] nvme_rdma: Unknown symbol ib_map_mr_sg (err -22)
[ 6999.636278] nvme_rdma: disagrees about version of symbol ib_cq_pool_put
[ 6999.636279] nvme_rdma: Unknown symbol ib_cq_pool_put (err -22)
[ 6999.636283] nvme_rdma: disagrees about version of symbol __ib_alloc_cq
[ 6999.636284] nvme_rdma: Unknown symbol __ib_alloc_cq (err -22)
[ 6999.636290] nvme_rdma: disagrees about version of symbol rdma_destroy_qp
[ 6999.636290] nvme_rdma: Unknown symbol rdma_destroy_qp (err -22)
[ 6999.636292] nvme_rdma: disagrees about version of symbol ib_check_mr_status
[ 6999.636293] nvme_rdma: Unknown symbol ib_check_mr_status (err -22)
[ 6999.636304] nvme_rdma: disagrees about version of symbol ib_destroy_qp_user
[ 6999.636305] nvme_rdma: Unknown symbol ib_destroy_qp_user (err -22)
[ 6999.636311] nvme_rdma: disagrees about version of symbol ib_cq_pool_get
[ 6999.636312] nvme_rdma: Unknown symbol ib_cq_pool_get (err -22)
[ 6999.636314] nvme_rdma: disagrees about version of symbol rdma_connect_locked
[ 6999.636314] nvme_rdma: Unknown symbol rdma_connect_locked (err -22)
[ 6999.636317] nvme_rdma: disagrees about version of symbol ib_wc_status_msg
[ 6999.636318] nvme_rdma: Unknown symbol ib_wc_status_msg (err -22)
[ 6999.636324] nvme_rdma: disagrees about version of symbol ib_dma_virt_map_sg
[ 6999.636325] nvme_rdma: Unknown symbol ib_dma_virt_map_sg (err -22)
[ 6999.636328] nvme_rdma: disagrees about version of symbol ib_free_cq
[ 6999.636328] nvme_rdma: Unknown symbol ib_free_cq (err -22)
[ 6999.636331] nvme_rdma: disagrees about version of symbol rdma_destroy_id
[ 6999.636332] nvme_rdma: Unknown symbol rdma_destroy_id (err -22)
[ 6999.636354] nvme_rdma: disagrees about version of symbol ib_mr_pool_get
[ 6999.636355] nvme_rdma: Unknown symbol ib_mr_pool_get (err -22)
[ 6999.636358] nvme_rdma: disagrees about version of symbol ib_mr_pool_put
[ 6999.636359] nvme_rdma: Unknown symbol ib_mr_pool_put (err -22)
[ 6999.636372] nvme_rdma: disagrees about version of symbol ib_drain_qp
[ 6999.636373] nvme_rdma: Unknown symbol ib_drain_qp (err -22)
[ 6999.636376] nvme_rdma: disagrees about version of symbol ib_dealloc_pd_user
[ 6999.636376] nvme_rdma: Unknown symbol ib_dealloc_pd_user (err -22)
[ 6999.636379] nvme_rdma: disagrees about version of symbol rdma_consumer_reject_data
[ 6999.636380] nvme_rdma: Unknown symbol rdma_consumer_reject_data (err -22)

How can I install DOCA in a way that replicates the effect of manually installing MLNX_OFED with custom flags as what I did before (including support for gds)? Is there a way to make the DOCA installer also compile and install MLNX_OFED’s version of the nvme-related kernel modules? Since I’ve already installed DOCA using apt, do I now need to manually compile the NVMe-related modules to ensure compatibility with the rest of MLNX_OFED? If so, what’s the recommended way to do that?

Hi

Thank you for reaching out. This is a known configuration challenge when transitioning from MLNX_OFED to DOCA-OFED. Let’s resolve this step-by-step:


Issue Summary

After installing DOCA via apt, the default kernel NVMe modules (nvme-rdma.ko) are loaded instead of DOCA/OFED-compatible versions, causing symbol mismatches (e.g., ib_mr_pool_destroy errors).


Root Cause

DOCA’s default installation does not automatically rebuild kernel modules like nvme-rdma for your specific kernel. The mismatch arises because:

  • MLNX_OFED previously compiled modules in /lib/modules/$(uname -r)/updates/dkms/.
  • DOCA uses the kernel’s default modules unless explicitly rebuilt.

Solution

To replicate MLNX_OFED’s --with-nvmf --enable-gds behavior in DOCA:

1. Install DOCA Extras and Kernel Support

sudo apt install -y doca-extra  # Includes kernel rebuild tools  
sudo /opt/mellanox/doca/tools/doca-kernel-support  # Rebuild modules for your kernel  

This generates a doca-kernel-repo*.deb package.

2. Install Rebuilt Modules

sudo dpkg -i /path/to/doca-kernel-repo*.deb  
sudo apt update  

3. Install NVMe-over-RDMA and GDS Packages

sudo apt install mlnx-nvme-dkms mlnx-nfsrdma-dkms  # DOCA-compatible NVMe modules  
sudo apt install doca-gds  # Enable GPU Direct Storage  

4. Verify Module Compatibility

modinfo nvme-rdma | grep "vermagic"  # Ensure version matches your kernel  
lsmod | grep nvme_rdma  # Confirm module loads without errors  

Additional Notes

  • Why This Happens: DOCA-OFED (post-Jan 2025) replaces standalone MLNX_OFED. The doca-kernel-support tool ensures module compatibility.
  • Profile Recommendation: Use doca-all for full NVMe/RDMA/GDS support.
  • Still Stuck? Manually compile modules from DOCA’s GitHub using --with-nvmf flags if needed.

For more details, refer to:

Best regards,
Ilan