Hi everyone,
We have a dozen Dell servers with CX3-series infiniband adapters. The M630 servers are connected to a Dell M4001T Infiniband switch, and the newly arrived R740XD are connected to a SX6025 switch. Both switches are connected with two QSFP cables.
We use R740XD as glusterfs servers. However vdbench tests using M630 as clients constantly failed, complaining brick disconnection, which made us suspect the connection. dmesg output is flooded with warnings that may relate to MLNX OFED.
Can somebody advise if something is indeed wrong with the adapters or is it a compatibility issue?
Thanks,
Wade
M630:
[7778551.705403] ------------[ cut here ]------------
[7778551.705407] WARNING: CPU: 12 PID: 145278 at /var/tmp/OFED_topdir/BUILD/mlnx-ofa_kernel-4.6/obj/default/drivers/infiniband/core/cma.c:689 cma_acquire_dev_by_src_ip+0x21e/0x230 [rdma_cm]
[7778551.705409] Modules linked in: socwatch2_11(OE) sep5(OE) socperf3(OE) pax(OE) nls_utf8 isofs loop nfsv3 nfs_acl vtsspp(OE) sep4_1(OE) socperf2_0(OE) fuse rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) mlx4_ib(OE) ib_uverbs(OE) ib_core(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd sg iTCO_wdt iTCO_vendor_support mxm_wmi dcdbas ipmi_si ipmi_devintf joydev pcspkr ipmi_msghandler acpi_power_meter wmi shpchp mei_me mei lpc_ich sunrpc knem(OE) ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mgag200
[7778551.705450] i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci mlx4_core(OE) libahci crct10dif_pclmul tg3 crct10dif_common crc32c_intel libata megaraid_sas i2c_core mlx_compat(OE) ptp devlink pps_core [last unloaded: pax]
[7778551.705466] CPU: 12 PID: 145278 Comm: glusterrdmaehan Tainted: G W OE ------------ 3.10.0-862.el7.x86_64 #1
[7778551.705468] Hardware name: Dell Inc. PowerEdge M630/0R10KJ, BIOS 2.8.0 05/23/2018
[7778551.705469] Call Trace:
[7778551.705472] [] dump_stack+0x19/0x1b
[7778551.705475] [] __warn+0xd8/0x100
[7778551.705478] [] warn_slowpath_null+0x1d/0x20
[7778551.705482] [] cma_acquire_dev_by_src_ip+0x21e/0x230 [rdma_cm]
[7778551.705486] [] rdma_bind_addr+0x91f/0x9e0 [rdma_cm]
[7778551.705489] [] ? path_openat+0x172/0x640
[7778551.705492] [] ? mutex_lock+0x12/0x2f
[7778551.705495] [] ucma_bind+0x93/0xe0 [rdma_ucm]
[7778551.705499] [] ucma_write+0xd8/0x160 [rdma_ucm]
[7778551.705502] [] vfs_write+0xc0/0x1f0
[7778551.705505] [] SyS_write+0x7f/0xf0
[7778551.705508] [] system_call_fastpath+0x1c/0x21
[7778551.705510] —[ end trace f51c264d0e36b3f0 ]—
[7778551.705513] ------------[ cut here ]------------
R740XD:
OE ------------ 3.10.0-1160.el7.x86_64 #1
[863438.434467] Hardware name: Dell Inc. PowerEdge R740xd/06WXJT, BIOS 2.12.2 07/09/2021
[863438.434471] Call Trace:
[863438.434473] [] dump_stack+0x19/0x1b
[863438.434476] [] __warn+0xd8/0x100
[863438.434479] [] warn_slowpath_null+0x1d/0x20
[863438.434482] [] cma_acquire_dev_by_src_ip+0x215/0x220 [rdma_cm]
[863438.434485] [] rdma_bind_addr+0x8fa/0x990 [rdma_cm]
[863438.434489] [] ? mutex_lock+0x12/0x2f
[863438.434492] [] ucma_bind+0xac/0x100 [rdma_ucm]
[863438.434494] [] ucma_write+0x101/0x180 [rdma_ucm]
[863438.434497] [] vfs_write+0xc0/0x1f0
[863438.434501] [] SyS_write+0x7f/0xf0
[863438.434503] [] system_call_fastpath+0x25/0x2a
[863438.434505] —[ end trace 748a792a8e2e93f8 ]—
[863438.434512] ------------[ cut here ]------------
[863438.434515] WARNING: CPU: 12 PID: 210017 at /var/tmp/OFED_topdir/BUILD/mlnx-ofa_kernel-4.9/obj/default/drivers/infiniband/core/cma.c:709 cma_acquire_dev_by_src_ip+0x215/0x220 [rdma_cm]
[863438.434518] Modules linked in: binfmt_misc iptable_filter ip_tables iscsi_target_mod dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio joydev target_core_user uio target_core_mod loop rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache tun rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mlxfw(OE) bonding mlx4_en(OE) bridge stp llc ip_set nfnetlink sunrpc vfat fat iTCO_wdt dell_smbios iTCO_vendor_support dell_wmi_descriptor dcdbas skx_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sg ipmi_ssif i2c_i801 mei_me mei lpc_ich wmi ipmi_si ipmi_devintf ipmi_msghandler
[863438.434557] acpi_power_meter acpi_pad knem(OE) xfs libcrc32c mlx4_ib(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common crc32c_intel ahci drm mlx4_core(OE) libahci megaraid_sas tg3 libata bnxt_en mlx_compat(OE) ptp devlink pps_core drm_panel_orientation_quirks nfit libnvdimm dm_mirror dm_region_hash dm_log dm_mod fuse [last unloaded: ip_tables]
[863438.434578] CPU: 12 PID: 210017 Comm: glusterrdmaehan Kdump: loaded Tainted: G W OE ------------ 3.10.0-1160.el7.x86_64 #1
[863438.434579] Hardware name: Dell Inc. PowerEdge R740xd/06WXJT, BIOS 2.12.2 07/09/2021
[863438.434580] Call Trace:
[863438.434583] [] dump_stack+0x19/0x1b
[863438.434588] [] __warn+0xd8/0x100
[863438.434591] [] warn_slowpath_null+0x1d/0x20
[863438.434595] [] cma_acquire_dev_by_src_ip+0x215/0x220 [rdma_cm]
[863438.434598] [] rdma_bind_addr+0x8fa/0x990 [rdma_cm]
[863438.434600] [] ? mutex_lock+0x12/0x2f
[863438.434605] [] ucma_bind+0xac/0x100 [rdma_ucm]
[863438.434607] [] ucma_write+0x101/0x180 [rdma_ucm]
[863438.434610] [] vfs_write+0xc0/0x1f0
[863438.434613] [] SyS_write+0x7f/0xf0
[863438.434619] [] system_call_fastpath+0x25/0x2a
[863438.434620] —[ end trace 748a792a8e2e93f9 ]—