dmesg flooded with ofed-related warnings

Hi everyone,

We have a dozen Dell servers with CX3-series infiniband adapters. The M630 servers are connected to a Dell M4001T Infiniband switch, and the newly arrived R740XD are connected to a SX6025 switch. Both switches are connected with two QSFP cables.

We use R740XD as glusterfs servers. However vdbench tests using M630 as clients constantly failed, complaining brick disconnection, which made us suspect the connection. dmesg output is flooded with warnings that may relate to MLNX OFED.

Can somebody advise if something is indeed wrong with the adapters or is it a compatibility issue?

Thanks,

Wade

M630:

[7778551.705403] ------------[ cut here ]------------

[7778551.705407] WARNING: CPU: 12 PID: 145278 at /var/tmp/OFED_topdir/BUILD/mlnx-ofa_kernel-4.6/obj/default/drivers/infiniband/core/cma.c:689 cma_acquire_dev_by_src_ip+0x21e/0x230 [rdma_cm]

[7778551.705409] Modules linked in: socwatch2_11(OE) sep5(OE) socperf3(OE) pax(OE) nls_utf8 isofs loop nfsv3 nfs_acl vtsspp(OE) sep4_1(OE) socperf2_0(OE) fuse rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) mlx4_ib(OE) ib_uverbs(OE) ib_core(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd sg iTCO_wdt iTCO_vendor_support mxm_wmi dcdbas ipmi_si ipmi_devintf joydev pcspkr ipmi_msghandler acpi_power_meter wmi shpchp mei_me mei lpc_ich sunrpc knem(OE) ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mgag200

[7778551.705450] i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci mlx4_core(OE) libahci crct10dif_pclmul tg3 crct10dif_common crc32c_intel libata megaraid_sas i2c_core mlx_compat(OE) ptp devlink pps_core [last unloaded: pax]

[7778551.705466] CPU: 12 PID: 145278 Comm: glusterrdmaehan Tainted: G W OE ------------ 3.10.0-862.el7.x86_64 #1

[7778551.705468] Hardware name: Dell Inc. PowerEdge M630/0R10KJ, BIOS 2.8.0 05/23/2018

[7778551.705469] Call Trace:

[7778551.705472] [] dump_stack+0x19/0x1b

[7778551.705475] [] __warn+0xd8/0x100

[7778551.705478] [] warn_slowpath_null+0x1d/0x20

[7778551.705482] [] cma_acquire_dev_by_src_ip+0x21e/0x230 [rdma_cm]

[7778551.705486] [] rdma_bind_addr+0x91f/0x9e0 [rdma_cm]

[7778551.705489] [] ? path_openat+0x172/0x640

[7778551.705492] [] ? mutex_lock+0x12/0x2f

[7778551.705495] [] ucma_bind+0x93/0xe0 [rdma_ucm]

[7778551.705499] [] ucma_write+0xd8/0x160 [rdma_ucm]

[7778551.705502] [] vfs_write+0xc0/0x1f0

[7778551.705505] [] SyS_write+0x7f/0xf0

[7778551.705508] [] system_call_fastpath+0x1c/0x21

[7778551.705510] —[ end trace f51c264d0e36b3f0 ]—

[7778551.705513] ------------[ cut here ]------------

R740XD:

OE ------------ 3.10.0-1160.el7.x86_64 #1

[863438.434467] Hardware name: Dell Inc. PowerEdge R740xd/06WXJT, BIOS 2.12.2 07/09/2021

[863438.434471] Call Trace:

[863438.434473] [] dump_stack+0x19/0x1b

[863438.434476] [] __warn+0xd8/0x100

[863438.434479] [] warn_slowpath_null+0x1d/0x20

[863438.434482] [] cma_acquire_dev_by_src_ip+0x215/0x220 [rdma_cm]

[863438.434485] [] rdma_bind_addr+0x8fa/0x990 [rdma_cm]

[863438.434489] [] ? mutex_lock+0x12/0x2f

[863438.434492] [] ucma_bind+0xac/0x100 [rdma_ucm]

[863438.434494] [] ucma_write+0x101/0x180 [rdma_ucm]

[863438.434497] [] vfs_write+0xc0/0x1f0

[863438.434501] [] SyS_write+0x7f/0xf0

[863438.434503] [] system_call_fastpath+0x25/0x2a

[863438.434505] —[ end trace 748a792a8e2e93f8 ]—

[863438.434512] ------------[ cut here ]------------

[863438.434515] WARNING: CPU: 12 PID: 210017 at /var/tmp/OFED_topdir/BUILD/mlnx-ofa_kernel-4.9/obj/default/drivers/infiniband/core/cma.c:709 cma_acquire_dev_by_src_ip+0x215/0x220 [rdma_cm]

[863438.434518] Modules linked in: binfmt_misc iptable_filter ip_tables iscsi_target_mod dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio joydev target_core_user uio target_core_mod loop rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache tun rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mlxfw(OE) bonding mlx4_en(OE) bridge stp llc ip_set nfnetlink sunrpc vfat fat iTCO_wdt dell_smbios iTCO_vendor_support dell_wmi_descriptor dcdbas skx_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sg ipmi_ssif i2c_i801 mei_me mei lpc_ich wmi ipmi_si ipmi_devintf ipmi_msghandler

[863438.434557] acpi_power_meter acpi_pad knem(OE) xfs libcrc32c mlx4_ib(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common crc32c_intel ahci drm mlx4_core(OE) libahci megaraid_sas tg3 libata bnxt_en mlx_compat(OE) ptp devlink pps_core drm_panel_orientation_quirks nfit libnvdimm dm_mirror dm_region_hash dm_log dm_mod fuse [last unloaded: ip_tables]

[863438.434578] CPU: 12 PID: 210017 Comm: glusterrdmaehan Kdump: loaded Tainted: G W OE ------------ 3.10.0-1160.el7.x86_64 #1

[863438.434579] Hardware name: Dell Inc. PowerEdge R740xd/06WXJT, BIOS 2.12.2 07/09/2021

[863438.434580] Call Trace:

[863438.434583] [] dump_stack+0x19/0x1b

[863438.434588] [] __warn+0xd8/0x100

[863438.434591] [] warn_slowpath_null+0x1d/0x20

[863438.434595] [] cma_acquire_dev_by_src_ip+0x215/0x220 [rdma_cm]

[863438.434598] [] rdma_bind_addr+0x8fa/0x990 [rdma_cm]

[863438.434600] [] ? mutex_lock+0x12/0x2f

[863438.434605] [] ucma_bind+0xac/0x100 [rdma_ucm]

[863438.434607] [] ucma_write+0x101/0x180 [rdma_ucm]

[863438.434610] [] vfs_write+0xc0/0x1f0

[863438.434613] [] SyS_write+0x7f/0xf0

[863438.434619] [] system_call_fastpath+0x25/0x2a

[863438.434620] —[ end trace 748a792a8e2e93f9 ]—​

Hello,

Without more information on the specifics of the environment, such as the HCA model, the version of OFED installed, the firmware version installed on the adapter, and other details it will not be possible to determine if this is a compatibility issue.

Please be advised that we do not directly provide support for vdbench issues, and recommend investigating the cause of any Mellanox adapters’ issues separately from the gluster environment first if possible.

We recommend ensuring that you have the latest supported MLNX_OFED and Mellanox ConnectX-3 firmware versions installed for your model.

For the latest Mellanox ConnectX-3 adapter firmware and release notes, please visit the following link:

https://www.mellanox.com/support/firmware/firmware-downloads

For the latest MLNX_OFED download links and User Manual, please visit the following link:

https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed

Once these are installed, please test connectivity with the utilities documented within the MLNX_OFED User Manual. There are also a number of benchmark utilities included with MLNX_OFED that may be useful in your testing for network and RDMA performance:

https://community.mellanox.com/s/article/perftest-package

If you find that you need further support in debugging a connectivity issue, please consider opening a case with our support team. If you do not have a current support contract, please email the team at Networking-contracts@nvidia.com to set a valid support contract

Thank you,

-Nvidia Network Support