Install MLNX OFED 4.7 (with "--upstream-libs" option) on CentOS 7.6 with kernel 4.20.0 caused MLNX NIC down

We have machine installed with CentOS 7.6 with custom built kernel 4.20.0.

If we install MLNX OFED 4.7 with “–add-kernel-support --skip-repo”, we can have the MLNX NIC up and working.

While if we install MLNX OFED 4.7 with addition option “–upstream-libs”, the NIC shown as down after openibd restart or even a reboot.

Any idea why and how to diagnose this issue?

Let me provide more details,

We have several systems with CentOS 7.6 installed, and we compiled 4.20.0 kernel on one of them (with Mellanox driver enabled), and packaged to RPMs.

Then we installed this kernel/kernel-devel RPM onto all machines, they can all boot into system successfully, and they all can recognize the mlnx nic card with correct state,


4: enp94s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000

link/ether 50:6b:4b:aa:e0:72 brd ff:ff:ff:ff:ff:ff

inet 10.2.2.25/24 brd 10.2.2.255 scope global noprefixroute enp94s0f0

valid_lft forever preferred_lft forever

inet6 fe80::75f7:872c:b059:a2a/64 scope link noprefixroute

valid_lft forever preferred_lft forever

5: enp94s0f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000

link/ether 50:6b:4b:aa:e0:73 brd ff:ff:ff:ff:ff:ff

As we’d like to run both RDMA/DPDK on our MLNX nic card, we choose to install it with “./mlnxofedinstall --add-kernel-support --upstream-libs”, compile and installation succeeded but after reboot, we see enp94s0f0/1 disappear.

If we use “ibv_devinfo” to check we can see both ports displayed as down.

Also we see following logs in /var/log/dmesg,


Nov 09 18:29:37 kernel: Compat-mlnx-ofed backport release: 1c4bf42

Nov 09 18:29:37 kernel: Backport based on mlnx_ofed/mlnx-ofa_kernel-4.0.git 1c4bf42

Nov 09 18:29:37 kernel: compat.git: mlnx_ofed/mlnx-ofa_kernel-4.0.git

Nov 09 18:29:37 kernel: mlx5_ib: disagrees about version of symbol mlx5_core_create_qp

Nov 09 18:29:37 kernel: mlx5_ib: Unknown symbol mlx5_core_create_qp (err -22)

Nov 09 18:29:37 kernel: mlx5_ib: disagrees about version of symbol mlx5_core_destroy_rq_tracked

Nov 09 18:29:37 kernel: mlx5_ib: Unknown symbol mlx5_core_destroy_rq_tracked (err -22)

Nov 09 18:29:37 kernel: mlx5_ib: disagrees about version of symbol mlx5_eswitch_add_send_to_vport_rule

Nov 09 18:29:37 kernel: mlx5_ib: Unknown symbol mlx5_eswitch_add_send_to_vport_rule (err -22)

Nov 09 18:29:37 kernel: mlx5_ib: disagrees about version of symbol mlx5_modify_header_alloc

Nov 09 18:29:37 kernel: mlx5_ib: Unknown symbol mlx5_modify_header_alloc (err -22)

Nov 09 18:29:37 kernel: mlx5_ib: disagrees about version of symbol mlx5_db_free

Nov 09 18:29:37 kernel: mlx5_ib: Unknown symbol mlx5_db_free (err -22)

Nov 09 18:29:37 systemd-udevd[1341]: Error running install command for mlx5_ib

Nov 09 18:29:37 systemd-udevd[1294]: modprobe: ERROR: could not insert 'mlx5_ib': Invalid argument

Nov 09 18:31:37 root[2240]: openibd: start(): Detected loaded old version of module 'mlx5_core', calling stop...

Nov 09 18:31:37 systemd[1]: rdma.service: main process exited, code=exited, status=1/FAILURE

Nov 09 18:31:37 systemd[1]: Failed to start Initialize the iWARP/InfiniBand/RDMA stack in the kernel.

Hello Jacky,

Hope your day is going well.

Do you still need help with this issue? If so I can convert it into a support ticket and continue work through there.

Thanks,

Abigail