IPoIB iscsi disconnects. CQE retry exceeded

Hi guys,

The new environment we are running is having some issues, the following is the architecture.

Compute - XenServer 6.2 (HP DL360 G7s)

Storage - OmniOS Comstar Target

Voltaire 4036 switching

Connectx2 cards

OFED 1.5.4

The system has been relatively stable, however when the system is placed under load it generates iscsi connection errors (1020) on the compute side, and CQE retry counter exceeded errors across the storage nodes.

I’ve search high and low for similar issues, but cannot find out any further information about what may be causing this issue.

Is there any information I am missing, or has anyone else experienced similar issues?

Where would one start to debug such an issue?

Thanks in advance.

Hi Andre,

I updated to 2.2, however now my xenserver Kernel panics when loading openibd:

Unloading HCA driver: [ OK ]

Message from syslogd@ at Mon Jul 7 19:24:45 2014 …

129 kernel: [ 186.038024] NMI: IOCK error (debug interrupt?)

Message from syslogd@ at Mon Jul 7 19:24:45 2014 …

129 kernel: [ 186.038085] Process swapper (pid: 0, ti=c04fe000 task=c05a88a0 task.ti=c04fe000)

Message from syslogd@ at Mon Jul 7 19:24:45 2014 …

129 kernel: [ 186.038089] Stack:

Message from syslogd@ at Mon Jul 7 19:24:45 2014 …

129 kernel: [ 186.038102] Call Trace:

Message from syslogd@ at Mon Jul 7 19:24:45 2014 …

129 kernel: [ 186.038126] Code: cc cc cc cc b8 1c 00 00 00 cd 82 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc b8 1d 00 00 00 cd 82 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc

Have you checked known issues in the release notes ? I see items 58 and 59 for xen: http://www.mellanox.com/related-docs/prod_software/Mellanox_OFED_Linux_Release_Notes_2_2-1_0_1.pdf http://www.mellanox.com/related-docs/prod_software/Mellanox_OFED_Linux_Release_Notes_2_2-1_0_1.pdf

Please share more details: screenshot, kernel version, HCA FW version and type. You can run also mstflint utility (http://www.mellanox.com/downloads/MFT/mft-3.6.0-24.tgz http://www.mellanox.com/downloads/MFT/mft-3.6.0-24.tgz ) like “mstflint -d 83:00.0 -qq q” to get more details on card (where 83:00.0 is pci device id for Mellanox card you can get with “lspci | grep Mellanox”)

Thanks

If you don’t have any limitations on what which driver you can use - try to install latest MLNX OFED (see link below). There have been number of significant improvements between 1.5,x and 2.x OFED versions specifically for IPoIB. Note that for ConnectX family adapters the default IPoIB operating mode is UD with 2.x drivers and it was CM with 1.5.x driver versions. With 2.x drivers UD is the recommended mode, so unless there are any other reasons to use CM in your setup – leave default UD.

http://www.mellanox.com/downloads/ofed/MLNX_OFED-2.2-1.0.1/MLNX_OFED_LINUX-2.2-1.0.1-xenserver6.x-i686.iso http://www.mellanox.com/downloads/ofed/MLNX_OFED-2.2-1.0.1/MLNX_OFED_LINUX-2.2-1.0.1-xenserver6.x-i686.iso

http://www.mellanox.com/downloads/ofed/MLNX_OFED-2.2-1.0.1/MLNX_OFED_LINUX-2.2-1.0.1-xenserver6.x-i686.iso http://www.mellanox.com/downloads/ofed/MLNX_OFED-2.2-1.0.1/MLNX_OFED_LINUX-2.2-1.0.1-xenserver6.x-i686.iso

The issue appears to be with HP ConnectX cards, we were using 2 of these in our cluster with fw 2.7, we downloaded and burnt 2.8 (sourced from hp site) after carefully checked PSID to match the documentation, however it appears post firmware update the HCA is now bricked.

lspci / mstflint report the new firmware, but this card KP’s the host.

We are not using pci passthrough, just using the IB fabric for the storage network

I’m assuming you’ve rebooted host after f/w upgrade since mstflint will report new fw without reboot but card will not use it unless you reboot…

If you still can access HCA with mstflint please get and send me its config/ini file – here is an example: “mstflint -d 83:00.0 dc > myhca.ini”. I’ll see if I can get binary for it.

Andre