I can only connect 9 nvme devices. When I try to connect 10th device it is failing

Hi all,
I have a problem with my NVME configuration. I have one
3b:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
2 ports and I am using it while connecting J2000 JBOF storage. Ofed is installed on the system and here the result of the

lsmod | grep nvme
nvme_tcp 45056 0
nvme 61440 2
nvme_rdma 49152 0
rdma_cm 139264 4 beegfs,rpcrdma,nvme_rdma,rdma_ucm
nvme_fabrics 28672 2 nvme_tcp,nvme_rdma
nvme_core 143360 6 nvme_tcp,nvme,nvme_rdma,nvme_fabrics
ib_core 479232 11 beegfs,rdma_cm,ib_ipoib,rpcrdma,nvme_rdma,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
mlx_compat 69632 19 beegfs,rdma_cm,ib_ipoib,mlxdevm,nvme_tcp,rpcrdma,nvme,nvme_rdma,iw_cm,nvme_core,svcrdma,nvme_fabrics,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core
t10_pi 16384 2 sd_mod,nvme_core

My problem is, on the storage side, there are 16 NVME disk and I can connect 9 of 16 randomly without any issue.
nvme connect -t rdma -a 192.168.XXX.XXX -n nqn.2015-11.com.hpe:nvme.j2000.mx6109008c.01
When I tried to attach 10th nvme disk, I got

[root@headnode ~]# nvme connect -t rdma -a 192.168.XXX.XXX -n nqn.2015-11.com.hpe:nvme.j2000.mx6109008c.01
Failed to write to /dev/nvme-fabrics: Input/output error
could not add new controller: failed to write to nvme-fabrics device

Why I can not mount 10th disk. Is there any restriction?

Does the dmesg reports as well “Connect command failed, error wo/DNR bit: 6”

Unload the nvme module:
modprobe -rv nvme

Reload it with this parameter:
modprobe -v nvme num_p2p_queues=1

Note : If you are planning to configure high availability (e.g using multipath), you’ll need to set this parameter to 2 (1 for each NVMEoF port + subsystem couple).

Verify:
cat /sys/module/nvme/parameters/num_p2p_queues
cat /sys/block/nvme#n#/device/num_p2p_queues

Attempt to re-connect

Hi spruitt

Here my last few line from dmesg output. There is already 4 nvme device on the headnode, while system booting they are mounting automatically. Than I am starting to add remote disks. While adding disk7, disk 8 dmesg output show mappings. When I am trying the 10th disk to add, It is failing with
[ 361.301022] nvme nvme13: Connect command failed: controller is busy or not available
[ 361.303074] nvme nvme13: failed to connect queue: 0 ret=385

The whole response here.

[ 332.487653] nvme nvme10: queue_size 128 > ctrl sqsize 15, clamping down
[ 332.488177] nvme nvme10: creating 63 I/O queues.
[ 333.836460] nvme nvme10: mapped 63/0/0 default/read/poll queues.
[ 340.125677] nvme nvme10: new ctrl: NQN “nqn.2015-11.com.hpe:nvme.j2000.mx6109008c.07”, addr 192.168.18.252:4420
[ 344.453864] nvme nvme11: Shutdown timeout set to 16 seconds
[ 344.528021] nvme nvme11: queue_size 128 > ctrl sqsize 15, clamping down
[ 344.528423] nvme nvme11: creating 63 I/O queues.
[ 345.878678] nvme nvme11: mapped 63/0/0 default/read/poll queues.
[ 352.165793] nvme nvme11: new ctrl: NQN “nqn.2015-11.com.hpe:nvme.j2000.mx6109008c.08”, addr 192.168.18.252:4420
[ 356.264408] nvme nvme12: Shutdown timeout set to 16 seconds
[ 356.358576] nvme nvme12: queue_size 128 > ctrl sqsize 15, clamping down
[ 356.359210] nvme nvme12: creating 8 I/O queues.
[ 356.530318] nvme nvme12: mapped 8/0/0 default/read/poll queues.
[ 357.327363] nvme nvme12: new ctrl: NQN “nqn.2015-11.com.hpe:nvme.j2000.mx6109008c.09”, addr 192.168.18.252:4420
[ 361.301022] nvme nvme13: Connect command failed: controller is busy or not available
[ 361.303074] nvme nvme13: failed to connect queue: 0 ret=385

By the way,
I can not unmount the first 4 nvme device because they are on the borad and when I unmounted them, they are immediately mounted automatically. Because of this reason, I can not apply
Unload the nvme module:
modprobe -rv nvme

Reload it with this parameter:
modprobe -v nvme num_p2p_queues=1

steps that you are pointed. Is there any way to push this parameter to the kernel and make it available during boot?

Create /etc/modprobe.d/nvme.conf, add options nvme nvme num_p2p_queues=1, dracut -f, reboot

After the reboot, parameter is now in place.

1 Like

Hi Spruitt

Now the parameter is in the space.

[root@headnode ~]# cat /sys/module/nvme/parameters/num_p2p_queues
1

However, When I tried to add 10th disk, it is failing again.

[root@headnode ~]# nvme connect -t rdma -a 192.168.18.252 -n nqn.2015-11.com.hpe:nvme.j2000.mx6109008c.10
Failed to write to /dev/nvme-fabrics: Input/output error
could not add new controller: failed to write to nvme-fabrics device

Here the /var/log/messages output

Mar 9 13:37:10 headnode kernel: nvme nvme13: Connect command failed: controller is busy or not available
Mar 9 13:37:10 headnode kernel: nvme nvme13: failed to connect queue: 0 ret=385

What should I do next?