mlx4_core: communication channel command 0x5 (op=0x24) timed out

One of our Linux servers, running SUSE x86_64 Linux got the following error, and the network interfaces went down.

Looking for any guidance on what this could be and how to address it:

ct 10 21:36:32 mtb0120qpr5 kernel: [4238431.515819] mlx4_core 0000:00:14.0: communication channel command 0x5 (op=0x24) timed out

Oct 10 21:36:32 mtb0120qpr5 kernel: [4238431.515825] mlx4_core 0000:00:14.0: device is going to be reset

Oct 10 21:36:32 mtb0120qpr5 kernel: [4238431.515829] mlx4_core 0000:00:14.0: VF is sending reset request to Firmware

Oct 10 21:36:32 mtb0120qpr5 kernel: [4238431.516494] mlx4_core 0000:00:14.0: VF Reset succeed

Oct 10 21:36:32 mtb0120qpr5 kernel: [4238431.516495] mlx4_core 0000:00:14.0: device was reset successfully

Oct 10 21:36:32 mtb0120qpr5 kernel: [4238431.516496] <mlx4_ib> mlx4_ib_handle_catas_error: mlx4_ib_handle_catas_error was started

Oct 10 21:36:32 mtb0120qpr5 kernel: [4238431.516507] <mlx4_ib> mlx4_ib_handle_catas_error: mlx4_ib_handle_catas_error ended

Oct 10 21:36:32 mtb0120qpr5 kernel: [4238431.516508] mlx4_en 0000:00:14.0: Internal error detected, restarting device

Oct 10 21:36:32 mtb0120qpr5 kernel: [4238431.516512] infiniband mlx4_0: ib_query_pkey failed (-5) for index 18

Oct 10 21:36:32 mtb0120qpr5 kernel: [4238431.516516] infiniband mlx4_0: ib_query_port failed (-5)

Oct 10 21:36:32 mtb0120qpr5 kernel: [4238431.795281] ib1: post_send_rss failed, error -5

The following is the lspci for the IB card:

00:14.0 0280: 15b3:1004

Subsystem: 15b3:61b0

Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-

Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-

Latency: 0

Interrupt: pin A routed to IRQ 25

Region 2: Memory at fa000000 (64-bit, prefetchable) [size=8M]

Capabilities: [60] Express (v2) Endpoint, MSI 00

DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us

ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset+

DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-

RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-

MaxPayload 128 bytes, MaxReadReq 128 bytes

DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-

LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM unknown, Latency L0 <64ns, L1 <1us

ClockPM- Surprise- LLActRep- BwNot-

LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-

ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-

LnkSta: Speed unknown, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-

DevCap2: Completion Timeout: Range ABCD, TimeoutDis+

DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-

LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB

Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-

Compliance De-emphasis: -6dB

LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-

EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-

Capabilities: [9c] MSI-X: Enable+ Count=4 Masked-

Vector table: BAR=2 offset=00002000

PBA: BAR=2 offset=00003000

Capabilities: [40] Power Management version 0

Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)

Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-

Kernel driver in use: mlx4_core

Kernel modules: mlx4_core

The following is a few of the Mellanox RPMs we have installed:

mlnx-ofa_kernel-4.7-OFED.4.7.3.2.9.1.g457f064.sles11sp4

mlnx-ofa_kernel-devel-4.7-OFED.4.7.3.2.9.1.g457f064.sles11sp4

mlnx-ofa_kernel-modules-4.7-OFED.4.7.3.2.9.1.g457f064.kver.3.0.101_107_default

Thanks.

Hello Greg,

Thank you for posting your inquiry on the NVIDIA Networking Community.

Based on the opcode, we researched internally and found that this issue was resolved a long time ago in f/w and driver update.

Please update the f/w and driver to the latest version available, for ConnectX-3 adapters this is MLNX_OFED 4LTS 4.9 and f/w depending on the PSID of the adapter, version 2.4x.xxxx.

If you still experiencing this issue after updating the driver and f/w, please do not hesitate to open a NVIDIA Networking Support Ticket by sending and email to networking-support@nvidia.com

We will gladly assist you through the support ticket.

Thank you and regards,

~NVIDIA Networking Technical Support