Hi everyone,
I’m facing a strange issue with NVMe-oF that I haven’t been able to fix easily. TLDR with the error: rdma connection establishment failed (-104) and Failed to write to /dev/nvme-fabrics: Connection reset by peer
I’m generally familiar with NVMe-oF setup, following this guide here: EnterpriseSupport My setup is a single node with multiple RDMA NICs and SSDs. With this configuration, I’ve been able to set up an NVMe-oF target on one NIC and let the host connect from another NIC. The setup and all my commands have worked multiple times before.
However, after a reboot today, I was no longer able to let the host connect using the same setup and scripts. The target setup shows no issues, and the host can discover the connection. But the actual connect fails, with error messages appearing both in the logs and in dmesg.
$ nvme discover -t rdma -a 192.168.1.20 -s 4420
Discovery Log Number of Records 2, Generation counter 2
=====Discovery Log Entry 0======
trtype: rdma
adrfam: ipv4
subtype: unrecognized
treq: not specified, sq flow control disable supported
portid: 0
trsvcid: 4420
subnqn: nqn.2014-08.org.nvmexpress.discovery
traddr: 192.168.1.20
rdma_prtype: not specified
rdma_qptype: connected
rdma_cms: rdma-cm
rdma_pkey: 0x0000
=====Discovery Log Entry 1======
trtype: rdma
adrfam: ipv4
subtype: nvme subsystem
treq: not specified, sq flow control disable supported
portid: 0
trsvcid: 4420
subnqn: zero
traddr: 192.168.1.20
rdma_prtype: not specified
rdma_qptype: connected
rdma_cms: rdma-cm
rdma_pkey: 0x0000
$ bash nvmeof_client.sh zero 192.168.1.20 4420 192.168.1.21 # my own connect script
Failed to write to /dev/nvme-fabrics: Connection reset by peer
$ dmesg
[ 2322.783144] nvmet: adding nsid 10 to subsystem zero
[ 2322.785800] nvmet_rdma: enabling port 0 (192.168.1.20:4420)
[ 2327.757810] nvmet: creating discovery controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:c6028c49-2d50-4692-8cbd-2679629e5a0a.
[ 2327.758099] nvme nvme10: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 192.168.1.20:4420
[ 2327.758248] nvme nvme10: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
[ 2331.064689] nvmet: creating discovery controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:c6028c49-2d50-4692-8cbd-2679629e5a0a.
[ 2331.064957] nvme nvme10: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 192.168.1.20:4420
[ 2331.065117] nvme nvme10: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
[ 2334.104665] nvme nvme10: rdma connection establishment failed (-104)
I found [a related issue https://stackoverflow.com/questions/78981318/nvme-discover-failed-failed-fo-write-to-dev-nvme-fabrics-connection-reset-by , but in my case I’ve managed the separate RDMA interfaces independently from my Ethernet addresses, so I don’t think that applies here. For reference, I’ll share the address list (this node actually has 8 NICs):
$ ip addr list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp226s0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 5c:ff:35:fb:ae:93 brd ff:ff:ff:ff:ff:ff
3: enp97s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 3c:ec:ef:b4:74:52 brd ff:ff:ff:ff:ff:ff
inet 10.0.2.51/24 brd 10.0.2.255 scope global enp97s0f0
valid_lft forever preferred_lft forever
inet6 2001:41b8:830:16e2:3eec:efff:feb4:7452/64 scope global dynamic mngtmpaddr noprefixroute
valid_lft 86399sec preferred_lft 14399sec
inet6 fe80::3eec:efff:feb4:7452/64 scope link
valid_lft forever preferred_lft forever
4: enp97s0f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether 3c:ec:ef:b4:74:53 brd ff:ff:ff:ff:ff:ff
5: enx060d30559abe: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 06:0d:30:55:9a:be brd ff:ff:ff:ff:ff:ff
6: enp12s0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9100 qdisc mq state UP group default qlen 1000
link/ether b8:ce:f6:16:d1:ba brd ff:ff:ff:ff:ff:ff
inet 192.168.1.20/24 scope global enp12s0np0
valid_lft forever preferred_lft forever
inet6 fe80::bace:f6ff:fe16:d1ba/64 scope link
valid_lft forever preferred_lft forever
7: enp18s0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9100 qdisc mq state UP group default qlen 1000
link/ether b8:ce:f6:16:e4:22 brd ff:ff:ff:ff:ff:ff
inet 192.168.1.21/24 scope global enp18s0np0
valid_lft forever preferred_lft forever
inet6 fe80::bace:f6ff:fe16:e422/64 scope link
valid_lft forever preferred_lft forever
8: enp75s0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9100 qdisc mq state UP group default qlen 1000
link/ether b8:ce:f6:16:d7:aa brd ff:ff:ff:ff:ff:ff
inet 192.168.1.22/24 scope global enp75s0np0
valid_lft forever preferred_lft forever
inet6 fe80::bace:f6ff:fe16:d7aa/64 scope link
valid_lft forever preferred_lft forever
9: enp84s0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9100 qdisc mq state UP group default qlen 1000
link/ether b8:ce:f6:16:d9:46 brd ff:ff:ff:ff:ff:ff
inet 192.168.1.23/24 scope global enp84s0np0
valid_lft forever preferred_lft forever
inet6 fe80::bace:f6ff:fe16:d946/64 scope link
valid_lft forever preferred_lft forever
10: enp141s0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9100 qdisc mq state UP group default qlen 1000
link/ether b8:ce:f6:16:e4:3a brd ff:ff:ff:ff:ff:ff
inet 192.168.1.24/24 scope global enp141s0np0
valid_lft forever preferred_lft forever
inet6 fe80::bace:f6ff:fe16:e43a/64 scope link
valid_lft forever preferred_lft forever
11: enp148s0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9100 qdisc mq state UP group default qlen 1000
link/ether b8:ce:f6:16:e4:12 brd ff:ff:ff:ff:ff:ff
inet 192.168.1.25/24 scope global enp148s0np0
valid_lft forever preferred_lft forever
inet6 fe80::bace:f6ff:fe16:e412/64 scope link
valid_lft forever preferred_lft forever
12: enp186s0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9100 qdisc mq state UP group default qlen 1000
link/ether b8:ce:f6:16:da:12 brd ff:ff:ff:ff:ff:ff
inet 192.168.1.26/24 scope global enp186s0np0
valid_lft forever preferred_lft forever
inet6 fe80::bace:f6ff:fe16:da12/64 scope link
valid_lft forever preferred_lft forever
13: enp204s0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9100 qdisc mq state UP group default qlen 1000
link/ether b8:ce:f6:16:d9:5e brd ff:ff:ff:ff:ff:ff
inet 192.168.1.27/24 scope global enp204s0np0
valid_lft forever preferred_lft forever
inet6 fe80::bace:f6ff:fe16:d95e/64 scope link
valid_lft forever preferred_lft forever
14: enp225s0f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether b8:ce:f6:2d:08:06 brd ff:ff:ff:ff:ff:ff
15: enp225s0f1np1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether b8:ce:f6:2d:08:07 brd ff:ff:ff:ff:ff:ff
16: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:50:6d:46:cd brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft forever
inet6 fe80::42:50ff:fe6d:46cd/64 scope link
valid_lft forever preferred_lft forever
18: veth953756b@if17: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default
link/ether b6:f5:f5:21:b7:2b brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet6 fe80::b4f5:f5ff:fe21:b72b/64 scope link
valid_lft forever preferred_lft forever
I also checked the firewall:
systemctl status ufw
● ufw.service - Uncomplicated firewall
Loaded: loaded (/lib/systemd/system/ufw.service; enabled; vendor preset: enabled)
Active: active (exited) since Mon 2025-12-15 17:50:45 UTC; 1h 1min ago
Docs: man:ufw(8)
Process: 7770 ExecStart=/lib/ufw/ufw-init start quiet (code=exited, status=0/SUCCESS)
Main PID: 7770 (code=exited, status=0/SUCCESS)
CPU: 2ms
Dez 15 17:50:45 dgx01.lab.<...>.de systemd[1]: Starting Uncomplicated firewall...
Dez 15 17:50:45 dgx01.lab.<...>.de systemd[1]: Finished Uncomplicated firewall.
$ ufw status verbose
Status: inactive
Status: inactive
I can provide more details about the setup if necessary. Thanks!