CX-5 EN, error 881 in dmesg

Hello there,
Trying to run nvme discover to NVME target.

nvme utility freeze for a some time and reports
#nvme discover -t rdma -a 172.16.0.35

Failed to write to /dev/nvme-fabrics: I/O Error
failed to add controller, error failed to write to nvme-fabrics device

dmesg reports:
nvme nvme2: I/O tag 0 (0000) opcode 0x7f (Fabrics Cmd) QID 0 timeout
nvme nvme2: Connect command failed, error wo/DNR bit: 881
nvme nvme2: failed to connect queue: 0 ret=881

Networks works fine, target got pings:

#ping 172.16.0.35
PING 172.16.0.35 (172.16.0.35) 56(84) bytes of data.
64 bytes from 172.16.0.35: icmp_seq=1 ttl=63 time=0.174 ms
64 bytes from 172.16.0.35: icmp_seq=2 ttl=63 time=0.122 ms
^C
— 172.16.0.35 ping statistics —
2 packets transmitted, 2 received, 0% packet loss, time 1056ms
rtt min/avg/max/mdev = 0.122/0.148/0.174/0.026 ms

I have two more identical boxes, they work fine, nvme discover and nvme connect with no errors.

lspci -v | grep Mellanox

01:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
Subsystem: Mellanox Technologies ConnectX-5 Ex EN network interface card, 100GbE dual-port QSFP28, PCIe4.0 x16, tall bracket; MCX516A-CDAT
01:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
Subsystem: Mellanox Technologies ConnectX-5 Ex EN network interface card, 100GbE dual-port QSFP28, PCIe4.0 x16, tall bracket; MCX516A-CDAT
21:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
Subsystem: Mellanox Technologies ConnectX-5 Ex EN network interface card, 100GbE dual-port QSFP28, PCIe4.0 x16, tall bracket; MCX516A-CDAT
21:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]

# ethtool -i ens4f0np0
driver: mlx5_core
version: 6.12.34-6.12-alt1
firmware-version: 16.35.4030 (MT_0000000013)
expansion-rom-version:
bus-info: 0000:21:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

lsmod | grep nvme
nvme_rdma 49152 0
nvme_fabrics 36864 1 nvme_rdma
rdma_cm 155648 6 rpcrdma,ib_srpt,nvme_rdma,ib_iser,ib_isert,rdma_ucm
ib_core 516096 13 rdma_cm,ib_ipoib,rpcrdma,ib_srpt,nvme_rdma,iw_cm,ib_iser,ib_umad,ib_isert,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm

#lsmod | grep rdma
nvme_rdma 49152 0
nvme_fabrics 36864 1 nvme_rdma
rpcrdma 454656 0
sunrpc 843776 1 rpcrdma
rdma_ucm 32768 0
rdma_cm 155648 6 rpcrdma,ib_srpt,nvme_rdma,ib_iser,ib_isert,rdma_ucm
iw_cm 61440 1 rdma_cm
ib_cm 155648 3 rdma_cm,ib_ipoib,ib_srpt
ib_uverbs 200704 2 rdma_ucm,mlx5_ib
ib_core 516096 13 rdma_cm,ib_ipoib,rpcrdma,ib_srpt,nvme_rdma,iw_cm,ib_iser,ib_umad,ib_isert,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm

lsmod | grep mlx5

mlx5_ib 491520 0
ib_uverbs 200704 2 rdma_ucm,mlx5_ib
ib_core 516096 13 rdma_cm,ib_ipoib,rpcrdma,ib_srpt,nvme_rdma,iw_cm,ib_iser,ib_umad,ib_isert,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
mlx5_core 2666496 1 mlx5_ib
psample 16384 2 openvswitch,mlx5_core
tls 151552 2 bonding,mlx5_core
pci_hyperv_intf 12288 1 mlx5_core

Any ideas would be greatly appreciated, including hints for debugging

A follow-up.

I believe, there may be some hardware problem, since:

  1. I’ve swapped CX-5 boards at another box, and they work fine.

  2. I’ve changed ports on switch to connect problem server to known working ports. Error 881 persists.

  3. I’ve installed plain Debian 12.12 bookworm instead of Alt Linux, Error 881 persists. Why not Trixie? Because of PVE.

  4. I’ve checked PCI settings in BIOS with other systems, they look the same and the problem still persists.

Debian 12 has mstflint 4.21.0 . Does it is possible to know PCI misconfiguration of isolate PCI problem from MellanoxCX-5 point of view by debug or trace ?

#echo 1 > /sys/kernel/debug/tracing/events/enable

looks like gave no effect,

root@debian:~# mstfwtrace -d 01:00.0 -i all
Read old events:

And no any output, just empty.

After some investigation, we believed this condition may be related to hardware PCI problem. Now we talk to supplier about replacement of server motherboard.

UPD. Supplier delivered new MB for server. But after service replacement and getting server back to the rack, problem still persists.

That should be done first, but finally I have to do it least - I swapped disk set from identical flawlessly working node to problem node.

And, voila, OS on problem node runs perfectly. No any error in dmesg. So I have to consider, this is exactly software problem. Error 881 (i/o error on /dev/nvme_fabrics) reveals on full functional hardware of second node.

A new hypothesis to try:

  1. There may be a difference in hardware setup between “failed” and “functional” nodes.
  2. OS (Linux), being installed, builds its (kernel?) configuration on all these differences and use them for all packages installed later.
  3. This could be differences in BIOS setup (a lot of modern BIOS systems have an enormous amount of non-obvious, unclear settings with no embedded help).
  4. BIOS may have hidden settings that vary among the “identical” hardware but unavailable for change from setup (hello, AMI)
  5. Some other conspirology speculations must go here, but stop.

I believe, use and compare dmidecode may help.

stay tuned…

After investigation, I have to believe this is problem closely related with /dev/nvme_fabrics and considered to be software problem.

We have changed motherboards, disks (SSD SATA of two vendors Samsung and Intel - no any nvme disks in system, Samsung M.2 NVME ) and even after that we still got error 881.

We have changed at least two OS: Debian 12 and Ubuntu 18.04.6 (due to Samsung DC Toolkit for NVME disks), and we still got the same result.

Any ideas highly appreciated.