Hi,
In one of our setup After changing number of jobs from 4 to 8, with the below error Host is crashing. Any inputs on this
[ 145.717020] 3cq completion failed with wr_id 0 status 13 opcode 1 vender_err 87
[ 145.717505] ERROR EXIT nvmeof_rdma_cq_event_handler
[ 145.718038] 3cq completion in ERROR state
Hi Rama,
So if you are not using iser, srp nor NVmeOF, then what protocol are you using to talk to your SSD’s?
Thank you,
Sophie.
Hi Rama,
What OS, Kernel and driver version are you using? (modinfo mlx4_core | grep -i version).
Have you seen an followed documents:
HowTo Compile Linux Kernel for NVMe over Fabrics https://community.mellanox.com/s/article/howto-compile-linux-kernel-for-nvme-over-fabrics
HowTo Configure NVMe over Fabrics https://community.mellanox.com/s/article/howto-configure-nvme-over-fabrics
What is the last trace generated in the messages file prior to crash?
Are you getting the same result with any jobs above 4 ? (IE: 5,6,7)
vender_err 87 reports a number of RNR NACK exceeding and terminate the QP. (receiver not ready (RNR) error).
Regards,
Sophie.
we are using mlx4 driver , kernel version is 3.17 RHEL
Hi Rama,
I am a little confused, what type of HCA card’s are installed on the Initiator/Target and are you using the mlx* Inbox driver from RHEL?
You posted Kernel version 3.17 but what OS version? (more /etc/issue or /etc/redhat-release).
This error correlate to Buffer/Memory allocation which can be possibly a FW issue on the HCA cards.
Based on the HCA cards, what is the FW running on them?
Thank you,
Sophie.
Hi Rama,
You posted Kernel version 3.17 but what OS version? (more /etc/issue or /etc/redhat-release).
Are you then using iser or srp for your configuration?
Thank you,
Sophie.
Hi Sophie,
Please find my answers inline
What OS, Kernel and driver version are you using? (modinfo
mlx4_core | grep -i version).
RHEL, kernel version 3.17
Have you seen an followed documents:
HowTo
Compile Linux Kernel for NVMe over Fabrics https://community.mellanox.com/s/article/howto-compile-linux-kernel-for-nvme-over-fabrics
HowTo
Configure NVMe over Fabrics https://community.mellanox.com/s/article/howto-configure-nvme-over-fabrics
[We are not referring this doc.] (we are not working NVMe OF standard Linux drivers)
What is the last trace
generated in the messages file prior to crash?
Are you getting the same result with any jobs above 4 ? (IE:
5,6,7)
We are running the 4 or more threads/jobs and getting into
situation.
vender_err 87 reports a number of RNR NACK exceeding and terminate
the QP. (receiver not ready (RNR) error).
In which situation we expect the receiver to flag RNR.
Is there OFED, mlx4 driver
dependency on this?
Or receiver does not have
sufficient CPU cycles?
Hi Rama,
Please disregard my statement about the type of HCA’s as it is not related here.
Also, what did you mean by " we are not working of standard Linux drivers".
Thank you,
Sophie.
HI Rama,
Then can you please describe in details you current configuration and which mlx* driver you are using?
Regards,
Sophie.
[root@xhdipsnvme1 ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Workstation release 7.0 (Maipo)
Are you then using iser or srp for your configuration?
In our kernel configuration we made iser and srp as loadble modules. But we are not using in our testing.