Error 881 in Mellanox CX-5, multiple errors in mlx5_poll_one()

Hello!

I have two of identical nodes with Mellanox CX-5 (MCX516A-CDAT) NICs
All networks settings are identical, I’ve changed and swapped everything
:

# cat /etc/os-release
PRETTY_NAME=“Debian GNU/Linux 13 (trixie)”
NAME=“Debian GNU/Linux”
VERSION_ID=“13”
VERSION=“13 (trixie)”
VERSION_CODENAME=trixie
DEBIAN_VERSION_FULL=13.4
ID=debian
HOME_URL=“https://www.debian.org/”
SUPPORT_URL=“https://www.debian.org/support”
BUG_REPORT_URL=“https://bugs.debian.org/”


#uname -r   #PVE kernel
6.17.13-2-pve


#ethtool -i 

# ethtool -i ens4f0np0
driver: mlx5_core
version: 6.17.13-2-pve
firmware-version: 16.35.8008 (MT_0000000013)
expansion-rom-version:
bus-info: 0000:21:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

# ethtool -i ens9f0np0
driver: mlx5_core
version: 6.17.13-2-pve
firmware-version: 16.35.8008 (MT_0000000013)
expansion-rom-version:
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

set up nvme kernel debug:
echo "module rdma_cm +p" > /sys/kernel/debug/dynamic_debug/control
echo "module nvme_rdma +p" > /sys/kernel/debug/dynamic_debug/control
echo "module mlx5_ib +p" > /sys/kernel/debug/dynamic_debug/control

at another session starts

dmesg -w

try to do nvme discover by issuing
nvme discober -t rdma -a 172.16.0.35

dmesg session shows on working node:
[ 309.553040] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x406
[ 309.553685] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x407
[ 309.554249] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x408
[ 309.554785] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x409
[ 309.555255] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x40a
[ 309.555723] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x40b
[ 309.556268] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x40c
[ 309.556863] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x40d
[ 309.557349] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x40e
[ 309.557800] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x40f
[ 309.558289] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x410
[ 309.558781] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x411
[ 309.559267] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x412
[ 309.559748] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x413
[ 309.560279] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x414
[ 309.560977] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x415
[ 309.561469] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x416
[ 309.561957] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x417
[ 309.562483] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x418
[ 309.562974] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x419
[ 309.563459] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x41a
[ 309.563941] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x41b
[ 309.564475] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x41c
[ 309.564971] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x41d
[ 309.565457] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x41e
[ 309.565932] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x41f
[ 309.566453] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x420
[ 309.566943] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x421
[ 309.567430] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x422
[ 309.567950] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x423
[ 309.568641] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x424
[ 309.569124] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x425
[ 309.569608] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x426
[ 309.570086] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x427
[ 309.570579] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x428
[ 309.571064] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x429
[ 309.571530] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x42a
[ 309.572002] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x42b
[ 309.572508] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x42c
[ 309.572987] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x42d
[ 309.573453] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x42e
[ 309.573942] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x42f
[ 309.574481] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x430
[ 309.574966] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x431
[ 309.575432] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x432
[ 309.575905] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x433
[ 309.576437] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x434
[ 309.576916] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x435
[ 309.577421] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x436
[ 309.577893] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x437
[ 309.578401] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x438
[ 309.578880] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x439
[ 309.579399] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x43a
[ 309.579863] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x43b
[ 309.580372] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x43c
[ 309.580853] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x43d
[ 309.581324] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x43e
[ 309.581806] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x43f
[ 309.582345] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x440
[ 309.582817] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x441
[ 309.583404] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x442
[ 309.583870] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x443
[ 309.584394] infiniband rocep33s0f0: mlx5_ib_create_cq:1034:(pid 2163): cqn 0x444
[ 309.584401] infiniband rocep33s0f0: calc_sq_size:601:(pid 2163): wqe_size 256
[ 309.585843] infiniband rocep33s0f0: create_qp:3149:(pid 2163): QP type 2, ib qpn 0x133D, mlx qpn 0x133d, rcqn 0x406, scqn 0x406, ece 0x0
[ 309.585854] infiniband rocep33s0f0: get_tx_affinity:4061:(pid 2163): Set tx affinity 0x2 to qpn 0x133d
[ 309.595898] infiniband rocep33s0f0: poll_soft_wc:595:(pid 1327): polled software generated completion on CQ 0x402
[ 309.597544] infiniband rocep33s0f0: poll_soft_wc:595:(pid 1327): polled software generated completion on CQ 0x402
[ 309.598285] nvme nvme0: queue_size 128 > ctrl sqsize 64, clamping down
[ 309.598296] nvme nvme0: new ctrl: NQN “nqn.2014-08.org.nvmexpress.discovery”, addr 172.16.0.35:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:cb8002e8-0929-4b34-aefa-6ec66ebfc1a4
[ 309.598891] nvme nvme0: Removing ctrl: NQN “nqn.2014-08.org.nvmexpress.discovery”
[ 309.625907] infiniband rocep33s0f0: poll_soft_wc:595:(pid 1327): polled software generated completion on CQ 0x402
[ 309.626261] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626266] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf5
[ 309.626396] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626400] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9
[ 309.626402] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626404] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9
[ 309.626406] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626407] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9
[ 309.626409] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626411] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9
[ 309.626413] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626414] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9
[ 309.626416] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626418] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9
[ 309.626419] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626421] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9
[ 309.626423] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626424] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9
[ 309.626426] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626428] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9
[ 309.626429] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626431] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9
[ 309.626433] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626434] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9
[ 309.626436] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626438] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9
[ 309.626439] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626441] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9
[ 309.626443] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626444] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9
[ 309.626446] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626448] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9
[ 309.626449] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626451] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9
[ 309.626454] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626455] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9
[ 309.626457] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626459] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9
[ 309.626460] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626462] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9
[ 309.626464] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626465] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9
[ 309.626467] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626469] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9
[ 309.626470] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626472] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9
[ 309.626474] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626475] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9
[ 309.626477] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626479] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9
[ 309.626480] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626482] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9
[ 309.626484] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626485] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9
[ 309.626487] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626489] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9
[ 309.626490] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626492] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9
[ 309.626494] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626495] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9
[ 309.626497] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626499] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9
[ 309.626500] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626502] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9
[ 309.626504] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Requestor error cqe on cqn 0x406:
[ 309.626505] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf5
[ 309.626606] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406:
[ 309.626608] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9

But everything works OK:

Discovery Log Number of Records 1, Generation counter 2
=====Discovery Log Entry 0======
trtype:  rdma
adrfam:  ipv4
subtype: nvme subsystem
treq:    not specified
portid:  0
trsvcid: 4420
subnqn:  nqn.2020-02.huawei.nvme:nvm-subsystem-sn-XXXXXXXXXXXXXXXXXXXXXX
traddr:  172.16.0.35
eflags:  none
rdma_prtype: roce-v2
rdma_qptype: connected
rdma_cms:    rdma-cm
rdma_pkey: 0000
root@node3:~#


A problem node(node1) does not discover nvme target:
root@node1:~# nvme discover -t rdma -a 172.16.0.35 -vv kernel supports: instance cntlid transport traddr trsvcid nqn queue_size nr_io_queues reconnect_delay ctrl_loss_tmo keep_alive_tmo hostnqn host_traddr host_iface hostid duplicate_connect disable_sqflow hdr_digest data_digest nr_write_queues nr_poll_queues tos keyring tls_key fast_io_fail_tmo discovery dhchap_secret dhchap_ctrl_secret tls concat recovery_delay connect ctrl, 'nqn=nqn.2014-08.org.nvmexpress.discovery,transport=rdma,traddr=172.16.0.35,trsvcid=4420,hostnqn=nqn.2014-08.org.nvmexpress:uuid:717a9176-ac73-4ea3-829e-e4ccf0b5735f,hostid=717a9176-ac73-4ea3-829e-e4ccf0b5735f,ctrl_loss_tmo=600' **Failed to write to /dev/nvme-fabrics: Input/output error failed to add controller, error failed to write to nvme-fabrics device**

In dmesg:

[356612.263017] nvme nvme0: I/O tag 0 (0000) opcode 0x7f (Fabrics Cmd) QID 0 timeout
[356612.263054] nvme nvme0: Connect command failed, error wo/DNR bit: 881
[356612.263463] nvme nvme0: failed to connect queue: 0 ret=881

at detailed dmesg log:
[ 4922.196036] nvme nvme0: address resolved (0): status 0 id 000000002ea79baa [ 4922.196396] infiniband rocep33s0f0: calc_sq_size:601:(pid 14440): wqe_size 256 [ 4922.196729] infiniband rocep33s0f0: create_qp:3149:(pid 14440): QP type 2, ib qpn 0x133F, mlx qpn 0x133f, rcqn 0x406, scqn 0x406, ece 0x0 [ 4922.196739] infiniband rocep33s0f0: get_tx_affinity:4061:(pid 14440): Set tx affinity 0x2 to qpn 0x133f [ 4922.205568] nvme nvme0: route resolved (2): status 0 id 000000002ea79baa [ 4922.205636] infiniband rocep33s0f0: poll_soft_wc:595:(pid 1333): polled software generated completion on CQ 0x402 [ 4922.206434] nvme nvme0: established (9): status 0 id 000000002ea79baa [ 4922.206449] infiniband rocep33s0f0: poll_soft_wc:595:(pid 1333): polled software generated completion on CQ 0x402 [ 4929.694236] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Requestor error cqe on cqn 0x406: [ 4929.694245] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x15, vendor syndrome 0x81 [ 4929.694475] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694479] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf4 [ 4929.694552] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694556] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4929.694558] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694559] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4929.694561] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694563] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4929.694565] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694566] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4929.694568] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694570] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4929.694571] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694573] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4929.694575] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694576] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4929.694578] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694580] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4929.694581] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694583] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4929.694585] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694586] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4929.694588] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694590] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4929.694591] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694593] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4929.694595] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694597] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4929.694598] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694600] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4929.694602] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694603] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4929.694605] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694607] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4929.694612] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694614] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4929.694615] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694618] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4929.694619] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694621] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4929.694623] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694625] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4929.694627] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694628] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4929.694630] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694632] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4929.694634] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694636] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4929.694638] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694640] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4929.694642] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694644] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4929.694646] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694648] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4929.694650] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694652] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4929.694654] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694656] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4929.694658] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694659] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4929.694662] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694663] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4929.694665] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4929.694667] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4983.306731] nvme nvme0: I/O tag 0 (0000) opcode 0x7f (Fabrics Cmd) QID 0 timeout [ 4983.306795] nvme nvme0: Connect command failed, error wo/DNR bit: 881 [ 4983.306978] infiniband rocep33s0f0: poll_soft_wc:595:(pid 1333): polled software generated completion on CQ 0x402 [ 4983.307079] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Requestor error cqe on cqn 0x406: [ 4983.307084] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4983.307124] nvme nvme0: disconnected (10): status 0 id 000000002ea79baa [ 4983.307128] nvme nvme0: disconnect received - connection closed [ 4983.307285] infiniband rocep33s0f0: mlx5_poll_one:527:(pid 0): Responder error cqe on cqn 0x406: [ 4983.307287] infiniband rocep33s0f0: mlx5_poll_one:530:(pid 0): syndrome 0x5, vendor syndrome 0xf9 [ 4983.307295] nvme nvme0: failed to connect queue: 0 ret=881

nvme discovery and nvme connect won’t work

Hello @lookin,

Thank you for posting your query on NVIDIA Community.

From the logs you shared, the messages

  • mlx5_poll_one: Requestor/Responder error cqe on cqn ..., and

  • nvme nvme0: Connect command failed, error wo/DNR bit: 881

show that the NVMe-oF connect is failing at the RDMA transport layer on the problematic node, which prevents nvme discover / nvme connect from completing.

Please check the following:

  1. Confirm that the failing node is using a supported software stack combination for ConnectX-5 and NVMe-oF over RoCE:

    • OS and kernel version

    • RDMA / NVMe-oF stack (mlx5_core, nvme_rdma, rdma_cm)

    • NVIDIA driver version

    • HCA firmware version

    • Switch firmware and cabling

  2. Match configuration with the working node
    Even if the nodes are intended to be identical, carefully compare all settings between the working and failing host, including:

    • MTU (NICs and switch ports)

    • VLAN / priority / PFC / ECN (if used)

    • IP addressing, routing, and any firewall/security rules

    • RoCE- or RDMA-related sysctls and driver tunables

  3. Basic RDMA connectivity & target documentation
    Ensure basic RDMA connectivity is healthy between the failing host and the NVMe-oF target using standard RDMA test tools, and review the NVMe-oF target vendor’s documentation for any specific initiator requirements (supported kernel/driver/FW levels, queue parameters, etc.).

If you still experience issues after confirming that you are using a fully supported and aligned software/firmware stack and configuration, a valid support Entitlement for the HCA in use will be needed to perform additional troubleshooting.

If there is an active entitlement/support contract in place, please do not hesitate to open a support ticket by logging into the ESP Portal and submitting a new case.

For contracts, please reach out to Networking-Contracts@nvidia.com.

Thanks,
NVEX Networking Technical Support Team