NVMeOF SLES 12 SP3 :  Initiator with 36 cores unable to discover/connect to target

Hi,

I am trying NVMeOF with RoCE on SLES 12 SP3 using the document

HowTo Configure NVMe over Fabrics https://community.mellanox.com/s/article/howto-configure-nvme-over-fabrics

I am noticing that whenever the initiator is having > 32 cores, the initiator is unable to discover/connect to the target. The same procedure works fine if the number of cores <= 32.

the dmesg:

kernel: [ 373.418811] nvme_fabrics: unknown parameter or missing value ‘hostid=a61ecf3f-2925-49a7-9304-cea147f61ae’ in ctrl creation request

for a successful connection:

[51354.292021] nvme nvme0: creating 32 I/O queues.

[51354.879684] nvme nvme0: new ctrl: NQN “mcx”, addr 192.168.0.1:4420

Is there any parameter that can restrict the number of the cores the mlx5_core/nvme_rdma/nvmet_rdma driver can use to restrict the IO queue creation and result in a successful discovery/connection? I won’t be able to disable the cores/hyperthreading from the BIOS/UEFI since there are other applications running on the host.

Appreciate any pointers/help!

We faced the same on SLES 12 SP3. We found that in SP3 release version there are two issues related to nvmeof initiator.

First, kernel 4.4.73-5-default does not know anything about hostid argument (this causes error message you observe). It was fixed in later updates, 4.4.92-6.18-default does not have this issue.

Second issue is in nvme-cli. As you may notice, the last letter from hostid is truncated: ‘hostid=a61ecf3f-2925-49a7-9304-cea147f61ae’, this causes kernel module to reject host id argument. The root cause is in nvme-cli patch that adds hostid support. It can be fixed by the simple patch added to nvme cli src rpm:

diff -crB nvme-cli-v1.2/linux/nvme.h nvme-cli-v1.2.patched/linux/nvme.h

*** nvme-cli-v1.2/linux/nvme.h Thu Dec 7 09:42:00 2017

— nvme-cli-v1.2.patched/linux/nvme.h Thu Dec 7 09:50:32 2017


*** 23,29 ****

/* However the max length of a qualified name is another size */

#define NVMF_NQN_SIZE 223

! #define NVMF_HOSTID_SIZE 36

#define NVMF_TRSVCID_SIZE 32

#define NVMF_TRADDR_SIZE 256

#define NVMF_TSAS_SIZE 256

— 23,29 ----

/* However the max length of a qualified name is another size */

#define NVMF_NQN_SIZE 223

! #define NVMF_HOSTID_SIZE 37

#define NVMF_TRSVCID_SIZE 32

#define NVMF_TRADDR_SIZE 256

#define NVMF_TSAS_SIZE 256

Hi,

Adding the parameter didn’t help. It still gives the same error:

athena:~ # nvme discover -t rdma -a 192.168.0.1 -s 4420

Failed to write to /dev/nvme-fabrics: Invalid argument

athena:~ # dmesg |tail -1

[ 1408.720843] nvme_fabrics: unknown parameter or missing value ‘hostid=a61ecf3f-2925-49a7-9304-cea147f61ae’ in ctrl creation request

athena:~ # nvme connect -t rdma --nr-io-queues=32 -a 192.168.0.1 -s 4420 -n mcx

Failed to write to /dev/nvme-fabrics: Invalid argument

athena:~ # !dm

dmesg |tail -1

[ 1437.914081] nvme_fabrics: unknown parameter or missing value ‘hostid=a61ecf3f-2925-49a7-9304-cea147f61ae’ in ctrl creation request

per the unsuccessful error print you’ve presented I can suggest that you use an nvme connect command options that I see is missing there, and that is: “–nr-io-queues”

This option specifies the number of io queues to allocate.

Have you tried this option?

For examples: # nvme connect --transport=rdma –nr-io-queues=36 --trsvcid=4420 --traddr=10.0.1.14 --nqn=test-nvm

Otherwise, you will hit the “default” option which is “num_online_cpus” (Number of controller IO queues that will be established), and this may explains the error you got:

“nvme_fabrics: unknown parameter or missing value ‘hostid=a61ecf3f-2925-49a7-9304-cea147f61ae’ in ctrl creation request”

read more on that in the article: Add nr_io_queues parameter to connect command: [PATCH v2] nvme-cli/fabrics: Add nr_io_queues parameter to connect command [PATCH v2] nvme-cli/fabrics: Add nr_io_queues parameter to connect command

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++

default:

  • pr_warn(“unknown parameter or missing value ‘%s’ in ctrl creation request\n”,

  • p);

  • ret = -EINVAL;

  • goto out;

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Hope this helps