ConnectX-5 error: Failed to write to /dev/nvme-fabrics: Invalid cross-device link

I have 2 ConnectX-5 NICs in my PC (Ubuntu 18.04, kernel 4.15.0-36). They are in 2 different subnets (192.168.1.100/24, 192.168.2.100/24). I have 4 NVMoF targets and I try to connect them from my PC:

sudo nvme connect -t rdma -a 192.168.2.52 -n nqn.2018-09.com.52 -s 4420

sudo nvme connect -t rdma -a 192.168.1.9 -n nqn.2018-09.com.9 -s 4420

sudo nvme connect -t rdma -a 192.168.2.54 -n nqn.2018-09.com.54 -s 4420

sudo nvme connect -t rdma -a 192.168.1.2 -n nqn.2018-09.com.2 -s 4420

Failed to write to /dev/nvme-fabrics: Invalid cross-device link

I disconnect all these targets and reboot the PC. Then I try to connect to these targets in a different order:

sudo nvme connect -t rdma -a 192.168.1.2 -n nqn.2018-09.com.2 -s 4420

sudo nvme connect -t rdma -a 192.168.1.9 -n nqn.2018-09.com.9 -s 4420

sudo nvme connect -t rdma -a 192.168.2.52 -n nqn.2018-09.com.52 -s 4420

Failed to write to /dev/nvme-fabrics: Invalid cross-device link

I google a bit. It seems that there are 2 report instances of this error message related to Mellanox NIC. But I don’t understand the nature of this error and I don’t see any work-around. Any suggestions? Here’s some info from my PC.

yao@Host1:~$ lspci | grep Mellan

15:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]

21:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]

yao@Host1:~$ lspci -vvv -s 15:00.0

15:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]

Subsystem: Mellanox Technologies MT27800 Family [ConnectX-5]

Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+

Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-

Latency: 0, Cache Line Size: 32 bytes

Interrupt: pin A routed to IRQ 33

NUMA node: 0

Region 0: Memory at 387ffe000000 (64-bit, prefetchable) [size=32M]

Expansion ROM at 90500000 [disabled] [size=1M]

Capabilities:

Kernel driver in use: mlx5_core

Kernel modules: mlx5_core

yao@Host1:~$ sudo lsmod | grep mlx

mlx5_ib 196608 0

ib_core 225280 9 ib_cm,rdma_cm,ib_umad,nvme_rdma,ib_uverbs,iw_cm,mlx5_ib,ib_ucm,rdma_ucm

mlx5_core 544768 1 mlx5_ib

mlxfw 20480 1 mlx5_core

devlink 45056 1 mlx5_core

ptp 20480 2 e1000e,mlx5_core

yao@Host1:~$ modinfo mlx5_core

filename: /lib/modules/4.15.0-36-generic/kernel/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko

version: 5.0-0

license: Dual BSD/GPL

Question continued…

description: Mellanox Connect-IB, ConnectX-4 core driver

author: Eli Cohen eli@mellanox.com

srcversion: C271CE9036D77E924A8E038

alias: pci:v000015B3d0000A2D3svsdbcsci*

alias: pci:v000015B3d0000A2D2svsdbcsci*

alias: pci:v000015B3d0000101Csvsdbcsci*

alias: pci:v000015B3d0000101Bsvsdbcsci*

alias: pci:v000015B3d0000101Asvsdbcsci*

alias: pci:v000015B3d00001019svsdbcsci*

alias: pci:v000015B3d00001018svsdbcsci*

alias: pci:v000015B3d00001017svsdbcsci*

alias: pci:v000015B3d00001016svsdbcsci*

alias: pci:v000015B3d00001015svsdbcsci*

alias: pci:v000015B3d00001014svsdbcsci*

alias: pci:v000015B3d00001013svsdbcsci*

alias: pci:v000015B3d00001012svsdbcsci*

alias: pci:v000015B3d00001011svsdbcsci*

depends: devlink,ptp,mlxfw

retpoline: Y

intree: Y

name: mlx5_core

vermagic: 4.15.0-36-generic SMP mod_unload

signat: PKCS#7

signer:

sig_key:

sig_hashalgo: md4

parm: debug_mask:debug mask: 1 = dump cmd data, 2 = dump cmd exec time, 3 = both. Default=0 (uint)

parm: prof_sel:profile selector. Valid range 0 - 2 (uint)

yao@Host1:~$ dmesg

[ 78.772669] nvme nvme0: queue_size 128 > ctrl maxcmd 64, clamping down

[ 78.856378] nvme nvme0: creating 8 I/O queues.

[ 88.297468] nvme nvme0: new ctrl: NQN “nqn.2018-09.com.52”, addr 192.168.2.52:4420

[ 101.561197] nvme nvme1: queue_size 128 > ctrl maxcmd 64, clamping down

[ 101.644852] nvme nvme1: creating 8 I/O queues.

[ 111.083806] nvme nvme1: new ctrl: NQN “nqn.2018-09.com.9”, addr 192.168.1.9:4420

[ 151.368016] nvme nvme2: queue_size 128 > ctrl maxcmd 64, clamping down

[ 151.451717] nvme nvme2: creating 8 I/O queues.

[ 160.893710] nvme nvme2: new ctrl: NQN “nqn.2018-09.com.54”, addr 192.168.2.54:4420

[ 169.789368] nvme nvme3: queue_size 128 > ctrl maxcmd 64, clamping down

[ 169.873068] nvme nvme3: creating 8 I/O queues.

[ 177.657661] nvme nvme3: Connect command failed, error wo/DNR bit: -16402

[ 177.657669] nvme nvme3: failed to connect queue: 4 ret=-18

[ 177.951379] nvme nvme3: Reconnecting in 10 seconds…

[ 188.138167] general protection fault: 0000 [#1] SMP PTI

[ 188.138172] Modules linked in: nvme_rdma rdma_ucm rdma_cm nvme_fabrics nvme_core ib_ucm ib_uverbs ib_umad iw_cm ib_cm nls_iso8859_1 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_hda_codec aes_x86_64 crypto_simd glue_helper cryptd snd_hda_core snd_hwdep intel_cstate snd_pcm cp210x snd_seq_midi snd_seq_midi_event joydev input_leds snd_rawmidi usbserial snd_seq snd_seq_device snd_timer snd mei_me soundcore wmi_bmof hp_wmi sparse_keymap ioatdma mac_hid intel_rapl_perf mei dca intel_wmi_thunderbolt shpchp serio_raw sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 mlx5_ib ib_core amdgpu chash hid_generic usbhid hid

Question continued…

[ 188.138248] radeon i2c_algo_bit ttm mlx5_core drm_kms_helper syscopyarea e1000e sysfillrect mlxfw sysimgblt devlink ahci fb_sys_fops ptp psmouse drm pps_core libahci wmi

[ 188.138272] CPU: 0 PID: 390 Comm: kworker/u56:7 Not tainted 4.15.0-36-generic #39-Ubuntu

[ 188.138275] Hardware name: HP HP Z4 G4 Workstation/81C5, BIOS P62 v01.51 05/08/2018

[ 188.138283] Workqueue: nvme-wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]

[ 188.138290] RIP: 0010:nvme_rdma_alloc_queue+0x3c/0x190 [nvme_rdma]

[ 188.138294] RSP: 0018:ffffc04c041e3e08 EFLAGS: 00010286

[ 188.138298] RAX: 0000000000000000 RBX: 890a8eecb83679a9 RCX: ffff9f9b5ec10820

[ 188.138301] RDX: ffffffffc0cd5600 RSI: ffffffffc0cd43ab RDI: ffff9f9ad037c000

[ 188.138304] RBP: ffffc04c041e3e28 R08: 000000000000020c R09: 0000000000000000

[ 188.138307] R10: 0000000000000000 R11: 000000000000020f R12: ffff9f9ad037c000

[ 188.138309] R13: 0000000000000000 R14: 0000000000000020 R15: 0000000000000000

[ 188.138313] FS: 0000000000000000(0000) GS:ffff9f9b5f200000(0000) knlGS:0000000000000000

[ 188.138316] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

[ 188.138319] CR2: 00007f347e159fb8 CR3: 00000001a740a006 CR4: 00000000003606f0

[ 188.138323] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

[ 188.138325] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

[ 188.138327] Call Trace:

[ 188.138335] nvme_rdma_configure_admin_queue+0x22/0x2d0 [nvme_rdma]

[ 188.138341] nvme_rdma_reconnect_ctrl_work+0x27/0xd0 [nvme_rdma]

[ 188.138349] process_one_work+0x1de/0x410

[ 188.138354] worker_thread+0x32/0x410

[ 188.138361] kthread+0x121/0x140

[ 188.138365] ? process_one_work+0x410/0x410

[ 188.138370] ? kthread_create_worker_on_cpu+0x70/0x70

[ 188.138378] ret_from_fork+0x35/0x40

[ 188.138381] Code: 89 e5 41 56 41 55 41 54 53 48 8d 1c c5 00 00 00 00 49 89 fc 49 89 c5 49 89 d6 48 29 c3 48 c7 c2 00 56 cd c0 48 c1 e3 04 48 03 1f <48> 89 7b 18 48 8d 7b 58 c7 43 50 00 00 00 00 e8 50 05 40 ce 45

[ 188.138443] RIP: nvme_rdma_alloc_queue+0x3c/0x190 [nvme_rdma] RSP: ffffc04c041e3e08

[ 188.138447] —[ end trace c9efe5e9bc3591f2 ]—

yao@Host1:~$ dmesg | grep mlx

[ 2.510581] mlx5_core 0000:15:00.0: enabling device (0100 → 0102)

[ 2.510732] mlx5_core 0000:15:00.0: firmware version: 16.21.2010

[ 4.055064] mlx5_core 0000:15:00.0: Port module event: module 0, Cable plugged

[ 4.061558] mlx5_core 0000:21:00.0: enabling device (0100 → 0102)

[ 4.061775] mlx5_core 0000:21:00.0: firmware version: 16.21.2010

[ 4.966172] mlx5_core 0000:21:00.0: Port module event: module 0, Cable plugged

[ 4.972503] mlx5_core 0000:15:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(64) RxCqeCmprss(0)

[ 5.110943] mlx5_core 0000:21:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(64) RxCqeCmprss(0)

[ 5.247925] mlx5_core 0000:15:00.0 enp21s0: renamed from eth0

[ 5.248600] mlx5_ib: Mellanox Connect-IB Infiniband driver v5.0-0

[ 5.275912] mlx5_core 0000:21:00.0 enp33s0: renamed from eth1

[ 23.736990] mlx5_core 0000:21:00.0 enp33s0: Link up

[ 23.953415] mlx5_core 0000:15:00.0 enp21s0: Link up

[ 188.138172] Modules linked in: nvme_rdma rdma_ucm rdma_cm nvme_fabrics nvme_core ib_ucm ib_uverbs ib_umad iw_cm ib_cm nls_iso8859_1 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_hda_codec aes_x86_64 crypto_simd glue_helper cryptd snd_hda_core snd_hwdep intel_cstate snd_pcm cp210x snd_seq_midi snd_seq_midi_event joydev input_leds snd_rawmidi usbserial snd_seq snd_seq_device snd_timer snd mei_me soundcore wmi_bmof hp_wmi sparse_keymap ioatdma mac_hid intel_rapl_perf mei dca intel_wmi_thunderbolt shpchp serio_raw sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 mlx5_ib ib_core amdgpu chash hid_generic usbhid hid

[ 188.138248] radeon i2c_algo_bit ttm mlx5_core drm_kms_helper syscopyarea e1000e sysfillrect mlxfw sysimgblt devlink ahci fb_sys_fops ptp psmouse drm pps_core libahci wmi

[ 662.506623] Modules linked in: cfg80211 nvme_rdma rdma_ucm rdma_cm nvme_fabrics nvme_core ib_ucm ib_uverbs ib_umad iw_cm ib_cm nls_iso8859_1 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_hda_codec aes_x86_64 crypto_simd glue_helper cryptd snd_hda_core snd_hwdep intel_cstate snd_pcm cp210x snd_seq_midi snd_seq_midi_event joydev input_leds snd_rawmidi usbserial snd_seq snd_seq_device snd_timer snd mei_me soundcore wmi_bmof hp_wmi sparse_keymap ioatdma mac_hid intel_rapl_perf mei dca intel_wmi_thunderbolt shpchp serio_raw sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 mlx5_ib ib_core amdgpu chash hid_generic

Please run # dmesg | grep “enabling port” - check if you get “…nvmet_rdma: enabling port…”

was this solved ? seems like an issue with resource allocation. Can you try using 4 queues instead of 8 ?