Getting vender_err 87

Hi,

In one of our setup After changing number of jobs from 4 to 8, with the below error Host is crashing. Any inputs on this

[ 145.717020] 3cq completion failed with wr_id 0 status 13 opcode 1 vender_err 87

[ 145.717505] ERROR EXIT nvmeof_rdma_cq_event_handler

[ 145.718038] 3cq completion in ERROR state

Hi Rama,

So if you are not using iser, srp nor NVmeOF, then what protocol are you using to talk to your SSD’s?

Thank you,

Sophie.

Hi Rama,

What OS, Kernel and driver version are you using? (modinfo mlx4_core | grep -i version).

Have you seen an followed documents:

HowTo Compile Linux Kernel for NVMe over Fabrics https://community.mellanox.com/s/article/howto-compile-linux-kernel-for-nvme-over-fabrics

HowTo Configure NVMe over Fabrics https://community.mellanox.com/s/article/howto-configure-nvme-over-fabrics

What is the last trace generated in the messages file prior to crash?

Are you getting the same result with any jobs above 4 ? (IE: 5,6,7)

vender_err 87 reports a number of RNR NACK exceeding and terminate the QP. (receiver not ready (RNR) error).

Regards,

Sophie.

we are using mlx4 driver , kernel version is 3.17 RHEL

I am not using NVEOF drivers which are mentioned as the below document

HowTo

Compile Linux Kernel for NVMe over Fabrics https://community.mellanox.com/s/article/howto-compile-linux-kernel-for-nvme-over-fabrics

HowTo

Configure NVMe over Fabrics https://community.mellanox.com/s/article/howto-configure-nvme-over-fabrics

Hi Rama,

I am a little confused, what type of HCA card’s are installed on the Initiator/Target and are you using the mlx* Inbox driver from RHEL?

You posted Kernel version 3.17 but what OS version? (more /etc/issue or /etc/redhat-release).

This error correlate to Buffer/Memory allocation which can be possibly a FW issue on the HCA cards.

Based on the HCA cards, what is the FW running on them?

Thank you,

Sophie.

Hi Rama,

You posted Kernel version 3.17 but what OS version? (more /etc/issue or /etc/redhat-release).

Are you then using iser or srp for your configuration?

Thank you,

Sophie.

Hi Sophie,

Please find my answers inline

What OS, Kernel and driver version are you using? (modinfo

mlx4_core | grep -i version).

RHEL, kernel version 3.17

Have you seen an followed documents:

HowTo

Compile Linux Kernel for NVMe over Fabrics https://community.mellanox.com/s/article/howto-compile-linux-kernel-for-nvme-over-fabrics

HowTo

Configure NVMe over Fabrics https://community.mellanox.com/s/article/howto-configure-nvme-over-fabrics

[We are not referring this doc.] (we are not working NVMe OF standard Linux drivers)

What is the last trace

generated in the messages file prior to crash?

Are you getting the same result with any jobs above 4 ? (IE:

5,6,7)

We are running the 4 or more threads/jobs and getting into

situation.

vender_err 87 reports a number of RNR NACK exceeding and terminate

the QP. (receiver not ready (RNR) error).

In which situation we expect the receiver to flag RNR.

Is there OFED, mlx4 driver

dependency on this?

Or receiver does not have

sufficient CPU cycles?

Hi Rama,

Please disregard my statement about the type of HCA’s as it is not related here.

Also, what did you mean by " we are not working of standard Linux drivers".

Thank you,

Sophie.

HI Rama,

Then can you please describe in details you current configuration and which mlx* driver you are using?

Regards,

Sophie.

[root@xhdipsnvme1 ~]# cat /etc/redhat-release

Red Hat Enterprise Linux Workstation release 7.0 (Maipo)

Are you then using iser or srp for your configuration?

In our kernel configuration we made iser and srp as loadble modules. But we are not using in our testing.