Kernel Modules from Mellanox OFED Stack Won't Load

I am running a freshly installed RHEL 6.4 server with kernel 2.6.32-358.11.1. I have installed the Mellanox OFED stack by downloading, running mlnx_add_kernel_support.sh -m ./ --make-tgz, and running mlnxofedinstall from the newly created .tgz. The resulting drivers will not load properly. Here is what I have so far:

[root@mdarisnfs01 tmp]# hca_self_test.ofed

---- Performing Adapter Device Self Test ----

Number of CAs Detected … 1

PCI Device Check … PASS

Kernel Arch … x86_64

Host Driver Version … MLNX_OFED_LINUX-2.0-2.0.5 (OFED-2.0-2.0.5): 2.6.32-358.11.1.el6.x86_64

Host Driver RPM Check … PASS

Firmware on CA #0 HCA … v2.7.0

Firmware Check on CA #0 (HCA) … NA

REASON: NO required fw version

Host Driver Initialization … FAIL

Number of CA Ports Active … NA

Error Counter Check … NA

Kernel Syslog Check … NA

Node GUID on CA #0 (HCA) … NA

------------------ DONE ---------------------

[root@mdarisnfs01 tmp]# cat hca_self_test_modprobe.output

WARNING: Error inserting ib_core (/lib/modules/2.6.32-358.11.1.el6.x86_64/extra/mlnx-ofa_kernel/drivers/infiniband/core/ib_core.ko): Invalid module format

WARNING: Error inserting ib_mad (/lib/modules/2.6.32-358.11.1.el6.x86_64/extra/mlnx-ofa_kernel/drivers/infiniband/core/ib_mad.ko): Invalid module format

WARNING: Error inserting ib_sa (/lib/modules/2.6.32-358.11.1.el6.x86_64/extra/mlnx-ofa_kernel/drivers/infiniband/core/ib_sa.ko): Invalid module format

WARNING: Error inserting ib_cm (/lib/modules/2.6.32-358.11.1.el6.x86_64/extra/mlnx-ofa_kernel/drivers/infiniband/core/ib_cm.ko): Invalid module format

FATAL: Error inserting ib_ipoib (/lib/modules/2.6.32-358.11.1.el6.x86_64/extra/mlnx-ofa_kernel/drivers/infiniband/ulp/ipoib/ib_ipoib.ko): Invalid module format

[root@mdarisnfs01 tmp]# dmesg | tail -n 5

compat: exports duplicate symbol __pskb_copy (owned by kernel)

compat: exports duplicate symbol __pskb_copy (owned by kernel)

compat: exports duplicate symbol __pskb_copy (owned by kernel)

compat: exports duplicate symbol __pskb_copy (owned by kernel)

compat: exports duplicate symbol __pskb_copy (owned by kernel)

[root@mdarisnfs01 tmp]# uname -a

Linux mdarisnfs01.mdanderson.org 2.6.32-358.11.1.el6.x86_64 #1 SMP Wed May 15 10:48:38 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux

[root@mdarisnfs01 tmp]# rpm -qa | grep -i mlnx

opensm-devel-4.0.0.MLNX20130311.156f5c0-0.1.x86_64

librdmacm-devel-1.0.17mlnx1-OFED.2.0.0.1.4.20130226.1156.g0c5d582.x86_64

libmlx4-1.0.4mlnx1-OFED.2.0.0.1.8.20130311.1052.g57dd6ea.x86_64

libibverbs-devel-static-1.1.6mlnx1-OFED.2.0.0.1.8.20130311.0904.g90c09c6.x86_64

libibmad-devel-1.3.9.MLNX20130311.0cae028-0.1.x86_64

libibumad-devel-1.3.8.MLNX20130311.0a67c01-0.1.x86_64

librdmacm-utils-1.0.17mlnx1-OFED.2.0.0.1.4.20130226.1156.g0c5d582.x86_64

infiniband-diags-1.6.1.MLNX20130311.21d799f-0.1.x86_64

libibcm-1.0.5mlnx1-OFED.2.0.0.0.9.20130210.1800.gc8011c5.x86_64

opensm-static-4.0.0.MLNX20130311.156f5c0-0.1.x86_64

libibverbs-devel-1.1.6mlnx1-OFED.2.0.0.1.8.20130311.0904.g90c09c6.x86_64

libibmad-1.3.9.MLNX20130311.0cae028-0.1.x86_64

opensm-libs-4.0.0.MLNX20130311.156f5c0-0.1.x86_64

libmlx4-devel-1.0.4mlnx1-OFED.2.0.0.1.8.20130311.1052.g57dd6ea.x86_64

libibumad-1.3.8.MLNX20130311.0a67c01-0.1.x86_64

librdmacm-1.0.17mlnx1-OFED.2.0.0.1.4.20130226.1156.g0c5d582.x86_64

srptools-0.0.4mlnx3-OFED.2.0.0.2.6.20130407.1400.g028ed29.x86_64

libibverbs-utils-1.1.6mlnx1-OFED.2.0.0.1.8.20130311.0904.g90c09c6.x86_64

libibmad-static-1.3.9.MLNX20130311.0cae028-0.1.x86_64

libibverbs-1.1.6mlnx1-OFED.2.0.0.1.8.20130311.0904.g90c09c6.x86_64

libibumad-static-1.3.8.MLNX20130311.0a67c01-0.1.x86_64

infiniband-diags-compat-1.6.1.MLNX20130311.21d799f-0.1.x86_64

libibcm-devel-1.0.5mlnx1-OFED.2.0.0.0.9.20130210.1800.gc8011c5.x86_64

opensm-4.0.0.MLNX20130311.156f5c0-0.1.x86_64

mlnxofed-docs-2.0-2.0.5.noarch

[root@mdarisnfs01 tmp]# rpm -qa | grep kernel

kernel-ib-2.0-2.6.32_358.11.1.el6.x86_64_OFED.2.0.2.0.5.g1593535.x86_64

libreport-plugin-kerneloops-2.0.9-15.el6.x86_64

kernel-headers-2.6.32-358.11.1.el6.x86_64

kernel-ib-devel-2.0-2.6.32_358.11.1.el6.x86_64_OFED.2.0.2.0.5.g1593535.x86_64

abrt-addon-kerneloops-2.0.8-15.el6.x86_64

kernel-devel-2.6.32-358.11.1.el6.x86_64

kernel-firmware-2.6.32-358.11.1.el6.noarch

dracut-kernel-004-303.el6.noarch

kernel-mft-3.0.0-2.6.32_358.11.1.el6.x86_64.x86_64

kernel-2.6.32-358.11.1.el6.x86_64

I am at a bit of a loss and any help would be appreciated

Not positive that this will solve your issue immediately, but you are running verrrrrrrrry old FW (2.7).

Try upgrading the FW first, see if that helps.

We had a similar issue here:

We are informing the teams. Stand by

Thanks. I was able to repeat this successfully.

I’m in the same situation with 2.6.32-358.11.1.el6_lustre.x86_64 for Lustre 2.1.6. Any ETA on when a workaround will be ready?

Yes, I have the same problem! My kernel version is: 2.6.32-358.11.1.el6.x86_64 and I am trying to install MLNX_OFED_LINUX-2.0-2.0.5-rhel6.4-x86_64.iso.

After ./mlnx_add_kernel_support.sh -m /mnt and running ./mlnxofedinstall

I can not use ib

a duplicate of this one Infrastructure & Networking - NVIDIA Developer Forums Infrastructure & Networking - NVIDIA Developer Forums

Can we somehow notify Mellanox about the issue?

The problem is that the 2.0-2.0.5 installator for RHEL 6.4 is unusable on a slightly updated kernel 2.6.32-358.11.1. On could just downgrade the kernel to 2.6.32-358 but this won’t always work. My installation depends on the kernel version for Lustre 2.1.6 RPM package which is 2.6.32-358 and I can’t just easily downgrade.

I have the new IB cards in a brand new HP server, I don’t think that this is an old FW issue.

Just as an FYI, the infiniband drivers that come with RHEL/CentOS 6.4 should work perfectly with your cards:

$ sudo yum groupinstall “Infiniband Support”

Apparently they’re not as optimised for some things as the Mellanox OFED stack (I don’t know the details), but for just general day to day stuff they should be fine.

Note, I think you’ll need to uninstall the Mellanox OFED stack first, before installing the built in ones.

Hope that helps.

Bump

Mellanox OFED 2.0-2.0.5 with kernel 2.6.32-358.11.1 (especially for Lustre 2.1.6)

compat: exports duplicate symbol __pskb_copy (owned by kernel)

Any update on this?

You folks using Lustre 2.1.6 for version 2.6.32-358.11.1.el6 and 2.6.32-358.11.1.el6_lustre — I guess you must be using the “stock” RHEL6.4 OFED v1.5.4. How is that working out?

Solution: I have found a workaround by rebuilding compat.ko and commenting out the export of symbol __pskb_copy.

The module now loads. I have not done stress testing with the rest of the MLNX OFED stack yet.

Original error:

compat: exports duplicate symbol __pskb_copy (owned by kernel)

After workaround:

Compat-mlnx-ofed backport release: gcecc987

Backport based on git://beany.openfabrics.org/compat-rdma/compat.git 3d70f8c

compat.git: git://beany.openfabrics.org/compat-rdma/compat.git

HOWTO: comment out the line “EXPORT_SYMBOL_GPL(__pskb_copy)” in ofa_kernel-2.0/compat/compat-3.3.c

thusly: /* EXPORT_SYMBOL_GPL(__pskb_copy); */

and rebuild. Copy compat.ko to /lib/modules/2.6.32-358.11.1.el6.x86_64/extra/mlnx-ofa_kernel/compat and run depmod -a

  1. Install OFED normally; this should succeed on 2.6.32-358.11.1.el6.x86_64 and 2.6.32-358.11.1.el6_lustre.x86_64

  2. At the top of your ISO or tar file extract src/MLNX_OFED_SRC-2.0-2.0.5.tgz

  3. Look for SRPMS/ofa_kernel-2.0-OFED.2.0.2.0.5.g1593535.src.rpm

  4. rpm -ivh ofa_kernel-2.0-OFED.2.0.2.0.5.g1593535.src.rpm

  5. Go to where you do your RPM builds

  6. rpmbuild -bc SPECS/ofa_kernel.spec

  7. cd BUILD/ofa_kernel-2.0

  8. Edit compat/compat-3.3.c as above

  9. run make in BUILD/ofa_kernel-2.0

I am attaching kernel modules for 358.11.1.el6, 3.5.8-11.1.el6_lustre; of course you should never use a random binary from a forum stranger. Use at your own risk and only on a VM or sacrificial system.

[root@head01 ~]# uname -r

2.6.32-358.11.1.el6.x86_64

[root@head01 ~]# dmesg | grep compat

Backport based on git://beany.openfabrics.org/compat-rdma/compat.git 3d70f8c

compat.git: git://beany.openfabrics.org/compat-rdma/compat.git

[root@head01 ~]# lsmod | grep compat

compat 17872 0

Hi,

I’ve created a patch similar to aalba6675’s, and I’ve also added changes to the compat.mk file and compat-3.3.h header file in an effort to only activate the changes only for EL6 kernels starting with 2.6.32-358.10.

Judging from the EL6 kernel RPM changelog, the addition of __pskb_copy was made in 2.6.32-358.10.1.el6 by the introduction a patch similar to this one:

http://kernel.opensuse.org/cgit/kernel/commit/?id=117632e64d2a5f464e491fe221d7169a3814a77b http://kernel.opensuse.org/cgit/kernel/commit/?id=117632e64d2a5f464e491fe221d7169a3814a77b

In order to use the existing mlnx_add_kernel_support.sh script to create patch kernel-ib RPMs, the patch needs to be integrated into the MLNX_OFED_SRC tarball. I’m using the following method to accomplish this:

  1. Extract MLNX_OFED_SRC-2.0-2.0.5/SRPMS/ofa_kernel-2.0-OFED.2.0.2.0.5.g1593535.src.rpm from src/MLNX_OFED_SRC-2.0-2.0.5.tgz
  2. Apply the patch to the contents of ofa_kernel-2.0.tgz
  3. Rebuild ofa_kernel-2.0.tgz
  4. Rebuild the source RPM using the modified specfile, and place it at MLNX_OFED_SRC-2.0-2.0.5/SRPMS/ofa_kernel-2.0-OFED.2.0.2.0.5.g1593535.src.rpm
  5. Rebuild MLNX_OFED_SRC-2.0-2.0.5.tgz

I’ve performed basic functionality tests with this patch on 2.6.32-358.6.2.el6 and 2.6.32-358.14.1.el6 kernels, and have not yet found any problems. I have not yet performed strenuous performance tests.

Larry

lpezzaglia Infrastructure & Networking - NVIDIA Developer Forums - Nice. That’s probably similar to what needs to be done in the next release of Mellanox OFED.

Thomas Graf, the original author of the Red Hat patch that adds __pskb_copy to the EL6 kernel in 358.10.1, mentioned in private email a few minutes ago that it’s been added on purpose (eg not a bug in EL6).

So, Mellanox’s compat module will need to work with it, probably like you’ve done.

(note - edited for typo fixes)