How to Compile NVMe Modules with DOCA Installation

Hi,

I’m running into an issue related to NVMe-over-RDMA after installing DOCA. Previously, I installed MLNX_OFED manually using:

sudo ./mlnxofedinstall --with-nvmf --with-nfsrdma --enable-gds --add-kernel-support --dkms

This method allowed me to include extra flags like --with-nvmf, which compiled the NVMe-related modules from MLNX_OFED sources. These were placed in:

/lib/modules/5.15.134/updates/dkms/

When I ran modprobe nvme-rdma, it would prioritize the modules from this directory and be compatible with the rest of MLNX_OFED (e.g., ib_core).

However, I learned from this doc that

MLNX_OFED has transitioned into DOCA-Host, and now available as DOCA-OFED (learn about DOCA-Host profiles here).

MLNX_OFED last standalone release is October 2024 Long Term Support (3 years). Starting January 2025 all new features will be included in DOCA-OFED only.

So I decided to install doca. When I install DOCA via:

sudo apt install doca-all
sudo apt install doca
~$ ofed_info -s
OFED-internal-24.04-0.7.0:

I am not sure, but it seems the NVMe-related modules (i.e., nvme.ko, nvme-rdma.ko) are not built during this process. So when I do ·modprobe nvme-rdma·, it loads the default kernel modules from:
/lib/modules/5.15.134/kernel/drivers/nvme/host/
These are built by the Linux kernel and not compatible with the rest of MLNX_OFED, leading to symbol or dependency mismatches.

~$ sudo modprobe nvme-rdma
modprobe: ERROR: could not insert 'nvme_rdma': Invalid argument
[ 6999.636085] nvme_rdma: disagrees about version of symbol ib_mr_pool_destroy
[ 6999.636088] nvme_rdma: Unknown symbol ib_mr_pool_destroy (err -22)
[ 6999.636095] nvme_rdma: disagrees about version of symbol ib_unregister_client
[ 6999.636096] nvme_rdma: Unknown symbol ib_unregister_client (err -22)
[ 6999.636114] nvme_rdma: disagrees about version of symbol rdma_reject_msg
[ 6999.636115] nvme_rdma: Unknown symbol rdma_reject_msg (err -22)
[ 6999.636142] nvme_rdma: disagrees about version of symbol __ib_alloc_pd
[ 6999.636144] nvme_rdma: Unknown symbol __ib_alloc_pd (err -22)
[ 6999.636175] nvme_rdma: disagrees about version of symbol rdma_resolve_addr
[ 6999.636176] nvme_rdma: Unknown symbol rdma_resolve_addr (err -22)
[ 6999.636182] nvme_rdma: disagrees about version of symbol rdma_set_service_type
[ 6999.636183] nvme_rdma: Unknown symbol rdma_set_service_type (err -22)
[ 6999.636187] nvme_rdma: disagrees about version of symbol ib_map_mr_sg_pi
[ 6999.636188] nvme_rdma: Unknown symbol ib_map_mr_sg_pi (err -22)
[ 6999.636195] nvme_rdma: disagrees about version of symbol ib_mr_pool_init
[ 6999.636195] nvme_rdma: Unknown symbol ib_mr_pool_init (err -22)
[ 6999.636199] nvme_rdma: disagrees about version of symbol ib_process_cq_direct
[ 6999.636200] nvme_rdma: Unknown symbol ib_process_cq_direct (err -22)
[ 6999.636212] nvme_rdma: disagrees about version of symbol ib_event_msg
[ 6999.636212] nvme_rdma: Unknown symbol ib_event_msg (err -22)
[ 6999.636220] nvme_rdma: disagrees about version of symbol rdma_disconnect
[ 6999.636220] nvme_rdma: Unknown symbol rdma_disconnect (err -22)
[ 6999.636240] nvme_rdma: disagrees about version of symbol __rdma_create_kernel_id
[ 6999.636240] nvme_rdma: Unknown symbol __rdma_create_kernel_id (err -22)
[ 6999.636253] nvme_rdma: disagrees about version of symbol rdma_resolve_route
[ 6999.636254] nvme_rdma: Unknown symbol rdma_resolve_route (err -22)
[ 6999.636260] nvme_rdma: disagrees about version of symbol ib_register_client
[ 6999.636261] nvme_rdma: Unknown symbol ib_register_client (err -22)
[ 6999.636266] nvme_rdma: disagrees about version of symbol rdma_create_qp
[ 6999.636267] nvme_rdma: Unknown symbol rdma_create_qp (err -22)
[ 6999.636272] nvme_rdma: disagrees about version of symbol ib_map_mr_sg
[ 6999.636273] nvme_rdma: Unknown symbol ib_map_mr_sg (err -22)
[ 6999.636278] nvme_rdma: disagrees about version of symbol ib_cq_pool_put
[ 6999.636279] nvme_rdma: Unknown symbol ib_cq_pool_put (err -22)
[ 6999.636283] nvme_rdma: disagrees about version of symbol __ib_alloc_cq
[ 6999.636284] nvme_rdma: Unknown symbol __ib_alloc_cq (err -22)
[ 6999.636290] nvme_rdma: disagrees about version of symbol rdma_destroy_qp
[ 6999.636290] nvme_rdma: Unknown symbol rdma_destroy_qp (err -22)
[ 6999.636292] nvme_rdma: disagrees about version of symbol ib_check_mr_status
[ 6999.636293] nvme_rdma: Unknown symbol ib_check_mr_status (err -22)
[ 6999.636304] nvme_rdma: disagrees about version of symbol ib_destroy_qp_user
[ 6999.636305] nvme_rdma: Unknown symbol ib_destroy_qp_user (err -22)
[ 6999.636311] nvme_rdma: disagrees about version of symbol ib_cq_pool_get
[ 6999.636312] nvme_rdma: Unknown symbol ib_cq_pool_get (err -22)
[ 6999.636314] nvme_rdma: disagrees about version of symbol rdma_connect_locked
[ 6999.636314] nvme_rdma: Unknown symbol rdma_connect_locked (err -22)
[ 6999.636317] nvme_rdma: disagrees about version of symbol ib_wc_status_msg
[ 6999.636318] nvme_rdma: Unknown symbol ib_wc_status_msg (err -22)
[ 6999.636324] nvme_rdma: disagrees about version of symbol ib_dma_virt_map_sg
[ 6999.636325] nvme_rdma: Unknown symbol ib_dma_virt_map_sg (err -22)
[ 6999.636328] nvme_rdma: disagrees about version of symbol ib_free_cq
[ 6999.636328] nvme_rdma: Unknown symbol ib_free_cq (err -22)
[ 6999.636331] nvme_rdma: disagrees about version of symbol rdma_destroy_id
[ 6999.636332] nvme_rdma: Unknown symbol rdma_destroy_id (err -22)
[ 6999.636354] nvme_rdma: disagrees about version of symbol ib_mr_pool_get
[ 6999.636355] nvme_rdma: Unknown symbol ib_mr_pool_get (err -22)
[ 6999.636358] nvme_rdma: disagrees about version of symbol ib_mr_pool_put
[ 6999.636359] nvme_rdma: Unknown symbol ib_mr_pool_put (err -22)
[ 6999.636372] nvme_rdma: disagrees about version of symbol ib_drain_qp
[ 6999.636373] nvme_rdma: Unknown symbol ib_drain_qp (err -22)
[ 6999.636376] nvme_rdma: disagrees about version of symbol ib_dealloc_pd_user
[ 6999.636376] nvme_rdma: Unknown symbol ib_dealloc_pd_user (err -22)
[ 6999.636379] nvme_rdma: disagrees about version of symbol rdma_consumer_reject_data
[ 6999.636380] nvme_rdma: Unknown symbol rdma_consumer_reject_data (err -22)

How can I install DOCA in a way that replicates the effect of manually installing MLNX_OFED with custom flags as what I did before (including support for gds)? Is there a way to make the DOCA installer also compile and install MLNX_OFED’s version of the nvme-related kernel modules? Since I’ve already installed DOCA using apt, do I now need to manually compile the NVMe-related modules to ensure compatibility with the rest of MLNX_OFED? If so, what’s the recommended way to do that?

1 Like

Hi

Thank you for reaching out. This is a known configuration challenge when transitioning from MLNX_OFED to DOCA-OFED. Let’s resolve this step-by-step:


Issue Summary

After installing DOCA via apt, the default kernel NVMe modules (nvme-rdma.ko) are loaded instead of DOCA/OFED-compatible versions, causing symbol mismatches (e.g., ib_mr_pool_destroy errors).


Root Cause

DOCA’s default installation does not automatically rebuild kernel modules like nvme-rdma for your specific kernel. The mismatch arises because:

  • MLNX_OFED previously compiled modules in /lib/modules/$(uname -r)/updates/dkms/.
  • DOCA uses the kernel’s default modules unless explicitly rebuilt.

Solution

To replicate MLNX_OFED’s --with-nvmf --enable-gds behavior in DOCA:

1. Install DOCA Extras and Kernel Support

sudo apt install -y doca-extra  # Includes kernel rebuild tools  
sudo /opt/mellanox/doca/tools/doca-kernel-support  # Rebuild modules for your kernel  

This generates a doca-kernel-repo*.deb package.

2. Install Rebuilt Modules

sudo dpkg -i /path/to/doca-kernel-repo*.deb  
sudo apt update  

3. Install NVMe-over-RDMA and GDS Packages

sudo apt install mlnx-nvme-dkms mlnx-nfsrdma-dkms  # DOCA-compatible NVMe modules  
sudo apt install doca-gds  # Enable GPU Direct Storage  

4. Verify Module Compatibility

modinfo nvme-rdma | grep "vermagic"  # Ensure version matches your kernel  
lsmod | grep nvme_rdma  # Confirm module loads without errors  

Additional Notes

  • Why This Happens: DOCA-OFED (post-Jan 2025) replaces standalone MLNX_OFED. The doca-kernel-support tool ensures module compatibility.
  • Profile Recommendation: Use doca-all for full NVMe/RDMA/GDS support.
  • Still Stuck? Manually compile modules from DOCA’s GitHub using --with-nvmf flags if needed.

For more details, refer to:

Best regards,
Ilan

Hey @ipavis It says doca-gds not found and even the DOCA github you linked is not available. Can you let me know what to use instead?

Hi,

We are also looking into similar issue with the Nvidia Network Operator.

We have deployed Network Operator 25.7.0 with ENABLE_NFSRDMA set to true in the NICClusterPolicy expecting this would enable GPUDirect Storage for both NVMe and NFS.

However it seems DOCA driver installation doesn’t patch NVMe drivers anymore leaving us only with NFS support for GDS.

Is it a known issue due to this or is there any workaround to fix it on a baremetal K8s environment?