I can only connect 9 nvme devices. When I try to connect 10th device it is failing

Hi all,
I have a problem with my NVME configuration. I have one
3b:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
2 ports and I am using it while connecting J2000 JBOF storage. Ofed is installed on the system and here the result of the

lsmod | grep nvme
nvme_tcp 45056 0
nvme 61440 2
nvme_rdma 49152 0
rdma_cm 139264 4 beegfs,rpcrdma,nvme_rdma,rdma_ucm
nvme_fabrics 28672 2 nvme_tcp,nvme_rdma
nvme_core 143360 6 nvme_tcp,nvme,nvme_rdma,nvme_fabrics
ib_core 479232 11 beegfs,rdma_cm,ib_ipoib,rpcrdma,nvme_rdma,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
mlx_compat 69632 19 beegfs,rdma_cm,ib_ipoib,mlxdevm,nvme_tcp,rpcrdma,nvme,nvme_rdma,iw_cm,nvme_core,svcrdma,nvme_fabrics,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core
t10_pi 16384 2 sd_mod,nvme_core

My problem is, on the storage side, there are 16 NVME disk and I can connect 9 of 16 randomly without any issue.
nvme connect -t rdma -a 192.168.XXX.XXX -n nqn.2015-11.com.hpe:nvme.j2000.mx6109008c.01
When I tried to attach 10th nvme disk, I got

[root@headnode ~]# nvme connect -t rdma -a 192.168.XXX.XXX -n nqn.2015-11.com.hpe:nvme.j2000.mx6109008c.01
Failed to write to /dev/nvme-fabrics: Input/output error
could not add new controller: failed to write to nvme-fabrics device

Why I can not mount 10th disk. Is there any restriction?

Does the dmesg reports as well “Connect command failed, error wo/DNR bit: 6”

Unload the nvme module:
modprobe -rv nvme

Reload it with this parameter:
modprobe -v nvme num_p2p_queues=1

Note : If you are planning to configure high availability (e.g using multipath), you’ll need to set this parameter to 2 (1 for each NVMEoF port + subsystem couple).

Verify:
cat /sys/module/nvme/parameters/num_p2p_queues
cat /sys/block/nvme#n#/device/num_p2p_queues

Attempt to re-connect

Hi spruitt

Here my last few line from dmesg output. There is already 4 nvme device on the headnode, while system booting they are mounting automatically. Than I am starting to add remote disks. While adding disk7, disk 8 dmesg output show mappings. When I am trying the 10th disk to add, It is failing with
[ 361.301022] nvme nvme13: Connect command failed: controller is busy or not available
[ 361.303074] nvme nvme13: failed to connect queue: 0 ret=385

The whole response here.

[ 332.487653] nvme nvme10: queue_size 128 > ctrl sqsize 15, clamping down
[ 332.488177] nvme nvme10: creating 63 I/O queues.
[ 333.836460] nvme nvme10: mapped 63/0/0 default/read/poll queues.
[ 340.125677] nvme nvme10: new ctrl: NQN “nqn.2015-11.com.hpe:nvme.j2000.mx6109008c.07”, addr 192.168.18.252:4420
[ 344.453864] nvme nvme11: Shutdown timeout set to 16 seconds
[ 344.528021] nvme nvme11: queue_size 128 > ctrl sqsize 15, clamping down
[ 344.528423] nvme nvme11: creating 63 I/O queues.
[ 345.878678] nvme nvme11: mapped 63/0/0 default/read/poll queues.
[ 352.165793] nvme nvme11: new ctrl: NQN “nqn.2015-11.com.hpe:nvme.j2000.mx6109008c.08”, addr 192.168.18.252:4420
[ 356.264408] nvme nvme12: Shutdown timeout set to 16 seconds
[ 356.358576] nvme nvme12: queue_size 128 > ctrl sqsize 15, clamping down
[ 356.359210] nvme nvme12: creating 8 I/O queues.
[ 356.530318] nvme nvme12: mapped 8/0/0 default/read/poll queues.
[ 357.327363] nvme nvme12: new ctrl: NQN “nqn.2015-11.com.hpe:nvme.j2000.mx6109008c.09”, addr 192.168.18.252:4420
[ 361.301022] nvme nvme13: Connect command failed: controller is busy or not available
[ 361.303074] nvme nvme13: failed to connect queue: 0 ret=385

By the way,
I can not unmount the first 4 nvme device because they are on the borad and when I unmounted them, they are immediately mounted automatically. Because of this reason, I can not apply
Unload the nvme module:
modprobe -rv nvme

Reload it with this parameter:
modprobe -v nvme num_p2p_queues=1

steps that you are pointed. Is there any way to push this parameter to the kernel and make it available during boot?

Create /etc/modprobe.d/nvme.conf, add options nvme nvme num_p2p_queues=1, dracut -f, reboot

After the reboot, parameter is now in place.

1 Like

Hi Spruitt

Now the parameter is in the space.

[root@headnode ~]# cat /sys/module/nvme/parameters/num_p2p_queues
1

However, When I tried to add 10th disk, it is failing again.

[root@headnode ~]# nvme connect -t rdma -a 192.168.18.252 -n nqn.2015-11.com.hpe:nvme.j2000.mx6109008c.10
Failed to write to /dev/nvme-fabrics: Input/output error
could not add new controller: failed to write to nvme-fabrics device

Here the /var/log/messages output

Mar 9 13:37:10 headnode kernel: nvme nvme13: Connect command failed: controller is busy or not available
Mar 9 13:37:10 headnode kernel: nvme nvme13: failed to connect queue: 0 ret=385

What should I do next?

Hi there. Would like to popup this thread. Have exactly same issue with J2000 connected directly with HPE IB HDR100/EN 100G 2p 940QSFP56;
I can connect to 3 drives from my proxmox host, but 4,5,6,7,8 slots are “unreachable” with exactly same errors outputs.

Failed to write to /dev/nvme-fabrics: Input/output error
could not add new controller: failed to write to nvme-fabrics device
…
[17799.415021] nvme nvme11: Connect command failed: controller is busy or not available
[17799.416805] nvme nvme11: failed to connect queue: 0 ret=385

Any suggestions?
@user88899 did you solved this problem?

Hi

Can you attach full dmesg from the client and the target which you have the issue?
thanks

After a bit of research, I found what approved J2000 drives can support 63 IO threads, however, IOM supporting only 128 IO threads (per QSFP port probably). That means, to connect all 8 drives nvme-connect should be done with flag “-i 16” to limit IO threads of connected nvme-of device to successfully connect all 8 drives on same IOM port(slot). So, my final config now looks like this connected to both IOM’s with 16 thread per disk, and it works great.

Hope this helps other users experiencing same problem.

ExecStart=/usr/sbin/nvme connect -t rdma -a 169.254.188.44 -s 4420 -n nqn.2015-11.com.hpe:nvme.j2000.mx6138001b.01 --hostnqn=nqn.2014-08.org.nvmexpress:uuid:35323550-3236-5a43-3233-343530394835 -i 16
ExecStart=/usr/sbin/nvme connect -t rdma -a 169.254.188.44 -s 4420 -n nqn.2015-11.com.hpe:nvme.j2000.mx6138001b.02 --hostnqn=nqn.2014-08.org.nvmexpress:uuid:35323550-3236-5a43-3233-343530394835 -i 16
ExecStart=/usr/sbin/nvme connect -t rdma -a 169.254.188.44 -s 4420 -n nqn.2015-11.com.hpe:nvme.j2000.mx6138001b.03 --hostnqn=nqn.2014-08.org.nvmexpress:uuid:35323550-3236-5a43-3233-343530394835 -i 16
ExecStart=/usr/sbin/nvme connect -t rdma -a 169.254.188.44 -s 4420 -n nqn.2015-11.com.hpe:nvme.j2000.mx6138001b.04 --hostnqn=nqn.2014-08.org.nvmexpress:uuid:35323550-3236-5a43-3233-343530394835 -i 16
ExecStart=/usr/sbin/nvme connect -t rdma -a 169.254.188.44 -s 4420 -n nqn.2015-11.com.hpe:nvme.j2000.mx6138001b.05 --hostnqn=nqn.2014-08.org.nvmexpress:uuid:35323550-3236-5a43-3233-343530394835 -i 16
ExecStart=/usr/sbin/nvme connect -t rdma -a 169.254.188.44 -s 4420 -n nqn.2015-11.com.hpe:nvme.j2000.mx6138001b.06 --hostnqn=nqn.2014-08.org.nvmexpress:uuid:35323550-3236-5a43-3233-343530394835 -i 16
ExecStart=/usr/sbin/nvme connect -t rdma -a 169.254.188.44 -s 4420 -n nqn.2015-11.com.hpe:nvme.j2000.mx6138001b.07 --hostnqn=nqn.2014-08.org.nvmexpress:uuid:35323550-3236-5a43-3233-343530394835 -i 16
ExecStart=/usr/sbin/nvme connect -t rdma -a 169.254.188.44 -s 4420 -n nqn.2015-11.com.hpe:nvme.j2000.mx6138001b.08 --hostnqn=nqn.2014-08.org.nvmexpress:uuid:35323550-3236-5a43-3233-343530394835 -i 16
ExecStart=/usr/sbin/nvme connect -t rdma -a 169.10.187.44 -s 4420 -n nqn.2015-11.com.hpe:nvme.j2000.mx6138001b.01 --hostnqn=nqn.2014-08.org.nvmexpress:uuid:35323550-3236-5a43-3233-343530394835 -i 16
ExecStart=/usr/sbin/nvme connect -t rdma -a 169.10.187.44 -s 4420 -n nqn.2015-11.com.hpe:nvme.j2000.mx6138001b.02 --hostnqn=nqn.2014-08.org.nvmexpress:uuid:35323550-3236-5a43-3233-343530394835 -i 16
ExecStart=/usr/sbin/nvme connect -t rdma -a 169.10.187.44 -s 4420 -n nqn.2015-11.com.hpe:nvme.j2000.mx6138001b.03 --hostnqn=nqn.2014-08.org.nvmexpress:uuid:35323550-3236-5a43-3233-343530394835 -i 16
ExecStart=/usr/sbin/nvme connect -t rdma -a 169.10.187.44 -s 4420 -n nqn.2015-11.com.hpe:nvme.j2000.mx6138001b.04 --hostnqn=nqn.2014-08.org.nvmexpress:uuid:35323550-3236-5a43-3233-343530394835 -i 16
ExecStart=/usr/sbin/nvme connect -t rdma -a 169.10.187.44 -s 4420 -n nqn.2015-11.com.hpe:nvme.j2000.mx6138001b.05 --hostnqn=nqn.2014-08.org.nvmexpress:uuid:35323550-3236-5a43-3233-343530394835 -i 16
ExecStart=/usr/sbin/nvme connect -t rdma -a 169.10.187.44 -s 4420 -n nqn.2015-11.com.hpe:nvme.j2000.mx6138001b.06 --hostnqn=nqn.2014-08.org.nvmexpress:uuid:35323550-3236-5a43-3233-343530394835 -i 16
ExecStart=/usr/sbin/nvme connect -t rdma -a 169.10.187.44 -s 4420 -n nqn.2015-11.com.hpe:nvme.j2000.mx6138001b.07 --hostnqn=nqn.2014-08.org.nvmexpress:uuid:35323550-3236-5a43-3233-343530394835 -i 16
ExecStart=/usr/sbin/nvme connect -t rdma -a 169.10.187.44 -s 4420 -n nqn.2015-11.com.hpe:nvme.j2000.mx6138001b.08 --hostnqn=nqn.2014-08.org.nvmexpress:uuid:35323550-3236-5a43-3233-343530394835 -i 16