OS: CentOS 7.8.2003
kernel: 3.10.0-1160.42.2.el7.x86_64
CPU: AMD Epyc 7302
IB: MCX456A-FCAT
MOFED: 4.9-3.1.5.0-LTS
Seeing a weird issue. System boots fine, no hardware errors detected. MLX IB card will not link and several infiniband-diagnostics commands hang or timeout. All of the hangs (when run with strace) show hangs during the reading of various sysfs files (/sys/class/infiniband/mlx5_0/ports/1/pkeys/*) or other files under (/sys/class/infiniband/mlx5_0/ports/1/).
I am able to see the HCA via lspci with proper speed and width detected. No PCIe errors. Loading of mlx5_core reports a good pcie bandwidth heuristic value. I can pull VPD data from the HCA with no problems. I can run mlxconfig query and flint query with no problem. The card reports, via mlx5_core dmesg, when a cable is connected or disconnected. If I run mlx_cables I get cable data. The card it in IB mode (not ethernet).
Every time an IB command hands (ibaddr, ibswitches, ibstatus) there is a random hang/timeout and running strace shows weird errors like -EEXISTS (file already exists when loading kernel module) or read errors on sysfs files or pointers.
More output/diag data below. Ideas? It seems hardware works but the sysfs hangs cause the problems.
Below the command strace ibaddr -L hangs trying to read from sysfs after several other sysfs files were successfully opened and read. Hangs on reading ports/1/pkeys/10
open("/sys/class/infiniband/mlx5_0/ports/1/pkeys/4", O_RDONLY) = 4
read(4, "0x8436\n", 32) = 7
close(4) = 0
open("/sys/class/infiniband/mlx5_0/ports/1/pkeys/5", O_RDONLY) = 4
read(4, "0x9336\n", 32) = 7
close(4) = 0
open("/sys/class/infiniband/mlx5_0/ports/1/pkeys/6", O_RDONLY) = 4
read(4, "0x8437\n", 32) = 7
close(4) = 0
open("/sys/class/infiniband/mlx5_0/ports/1/pkeys/7", O_RDONLY) = 4
read(4, "0x8438\n", 32) = 7
close(4) = 0
open("/sys/class/infiniband/mlx5_0/ports/1/pkeys/8", O_RDONLY) = 4
read(4, "0x9338\n", 32) = 7
close(4) = 0
open("/sys/class/infiniband/mlx5_0/ports/1/pkeys/9", O_RDONLY) = 4
read(4, "0x9340\n", 32) = 7
close(4) = 0
open("/sys/class/infiniband/mlx5_0/ports/1/pkeys/10", O_RDONLY) = 4
read(4,
Non-IB commands like mlxvpd or lspci work fine.
# mlxvpd -d 21:00.0
VPD-KEYWORD DESCRIPTION VALUE
----------- ----------- -----
Read Only Section:
PN Part Number MCX456A-FCAT
EC Revision AF
SN Serial Number MT1827K06622
V0 Misc Info PCIeGen3 x16
RV Checksum Complement 0x18
IDTAG Board Id CX456A - ConnectX-4 QSFP28
# lspci -s 21:00.0 -vvv
21:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
Subsystem: Mellanox Technologies Device 0012
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 120
NUMA node: 0
Region 0: Memory at 3007e000000 (64-bit, prefetchable) [size=32M]
Expansion ROM at a6400000 [disabled] [size=1M]
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 512 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [48] Vital Product Data
Product Name: CX456A - ConnectX-4 QSFP28
Read-only fields:
[PN] Part number: MCX456A-FCAT
[EC] Engineering changes: AF
[SN] Serial number: MT1827K06622
[V0] Vendor specific: PCIeGen3 x16
[RV] Reserved: checksum good, 2 byte(s) reserved
End
Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
Vector table: BAR=0 offset=00002000
PBA: BAR=0 offset=00003000
Capabilities: [c0] Vendor Specific Information: Len=18 <?>
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO+ CmpltAbrt- UnxCmplt+ RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
AERCap: First Error Pointer: 04, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 1
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [1c0 v1] #19
Capabilities: [230 v1] Access Control Services
ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
Kernel driver in use: mlx5_core
Kernel modules: mlx5_core
dmesg -T | grep mlx
mlx5_core 0000:21:00.0: firmware version: 12.24.1000
mlx5_core 0000:21:00.0: 126.016 Gb/s available PCIe bandwidth (8 GT/s x16 link)
mlx5_core 0000:21:00.0: irq 188 for MSI/MSI-X
(repeats in sequence to irq 220)
mlx5_core 0000:21:00.0: irq 220 for MSI/MSI-X
mlx5_core 0000:21:00.0: Port module event: module 0, Cable unplugged
mlx5_core 0000:21:00.0: wait_func_handle_exec_timeout:1049:(pid 466): Recovered 1 EQEs on cmd_eq
mlx5_core 0000:21:00.0: mlx5_fw_tracer_start:810:(pid 466): FWTracer: Ownership granted and active
mlx5_core 0000:21:00.0: wait_func_handle_exec_timeout:1049:(pid 466): Recovered 2 EQEs on cmd_eq
mlx5_core 0000:21:00.1: firmware version: 12.24.1000
mlx5_core 0000:21:00.1: 126.016 Gb/s available PCIe bandwidth (8 GT/s x16 link)
mlx5_core 0000:21:00.1: irq 222 for MSI/MSI-X
(repeats in sequence to irq 254)
mlx5_core 0000:21:00.1: irq 254 for MSI/MSI-X
mlx5_core 0000:21:00.1: Port module event: module 1, Cable plugged
mlx5_ib: Mellanox Connect-IB Infiniband driver v4.9-3.1.5
mlx5_core 0000:21:00.0: wait_func_handle_exec_timeout:1049:(pid 557): Recovered 2 EQEs on cmd_eq
mlx5_core 0000:21:00.0: wait_func_handle_exec_timeout:1049:(pid 557): Recovered 2 EQEs on cmd_eq
mlx5_core 0000:21:00.0: wait_func_handle_exec_timeout:1054:(pid 781): Recovered 0 EQEs on cmd_eq, no done completion for ent (1)
mlx5_core 0000:21:00.0: wait_func_handle_exec_timeout:1049:(pid 557): Recovered 2 EQEs on cmd_eq
mlx5_core 0000:21:00.0: wait_func:1081:(pid 781): QUERY_HCA_VPORT_PKEY(0x765) timeout. Will cause a leak of a command resource
infiniband mlx5_0: ib_query_pkey failed (-110) for index 31
infiniband mlx5_0: mlx5_ib_enable_driver:7341:(pid 781): Testing write-combining support failed with error=-2
mlx5_core 0000:21:00.1: wait_func_handle_exec_timeout:1049:(pid 781): Recovered 2 EQEs on cmd_eq
mlx5_core 0000:21:00.0: wait_func_handle_exec_timeout:1049:(pid 557): Recovered 1 EQEs on cmd_eq
mlx5_core 0000:21:00.1: wait_func:1085:(pid 781): CREATE_MKEY(0x200) canceled on out of queue timeout.
infiniband mlx5_0: init_driver_cnak:1251:(pid 781): failed to create dc DMA MR
mlx5_core 0000:21:00.1: wait_func_handle_exec_timeout:1049:(pid 756): Recovered 43 EQEs on cmd_eq
mlx5_core 0000:21:00.0: wait_func_handle_exec_timeout:1049:(pid 557): Recovered 1 EQEs on cmd_eq
mlx5_core 0000:21:00.1: wait_func_handle_exec_timeout:1049:(pid 759): Recovered 32 EQEs on cmd_eq
mlx5_core 0000:21:00.0: wait_func_handle_exec_timeout:1049:(pid 557): Recovered 1 EQEs on cmd_eq
mlx5_core 0000:21:00.1: wait_func_handle_exec_timeout:1049:(pid 759): Recovered 32 EQEs on cmd_eq
mlx5_core 0000:21:00.1: wait_func_handle_exec_timeout:1049:(pid 756): Recovered 32 EQEs on cmd_eq
mlx5_core 0000:21:00.1: wait_func:1085:(pid 2758): CREATE_MKEY(0x200) canceled on out of queue timeout.
mlx5_core 0000:21:00.1: mlx5e_create_mdev_resources:111:(pid 2758): create mkey failed, -125
mlx5_core 0000:21:00.1: cb_timeout_handler:882:(pid 221): CREATE_MKEY(0x200) timeout. Will cause a leak of a command resource
mlx5_core 0000:21:00.1: cb_timeout_handler:882:(pid 220): CREATE_MKEY(0x200) timeout. Will cause a leak of a command resource
infiniband mlx5_0: create_mkey_callback:206:(pid 220): async reg mr failed. status -110
infiniband mlx5_0: create_mkey_callback:206:(pid 221): async reg mr failed. status -110
mlx5_core 0000:21:00.1: cb_timeout_handler:882:(pid 220): CREATE_MKEY(0x200) timeout. Will cause a leak of a command resource
infiniband mlx5_0: create_mkey_callback:206:(pid 220): async reg mr failed. status -110
mlx5_core 0000:21:00.1: cmd_work_handler:921:(pid 2857): failed to allocate command entry
infiniband mlx5_0: create_mkey_callback:206:(pid 2857): async reg mr failed. status -11
mlx5_core 0000:21:00.1: cmd_work_handler:921:(pid 2857): failed to allocate command entry
infiniband mlx5_0: create_mkey_callback:206:(pid 2857): async reg mr failed. status -11
mlx5_core 0000:21:00.1: cmd_work_handler:921:(pid 2857): failed to allocate command entry
infiniband mlx5_0: create_mkey_callback:206:(pid 2857): async reg mr failed. status -11
mlx5_core 0000:21:00.1: cmd_work_handler:921:(pid 2857): failed to allocate command entry
infiniband mlx5_0: create_mkey_callback:206:(pid 2857): async reg mr failed. status -11
mlx5_core 0000:21:00.1: cmd_work_handler:921:(pid 2857): failed to allocate command entry
infiniband mlx5_0: create_mkey_callback:206:(pid 2857): async reg mr failed. status -11
mlx5_core 0000:21:00.1: cmd_work_handler:921:(pid 2857): failed to allocate command entry
infiniband mlx5_0: create_mkey_callback:206:(pid 2857): async reg mr failed. status -11
mlx5_core 0000:21:00.1: cmd_work_handler:921:(pid 2857): failed to allocate command entry
mlx5_core 0000:21:00.1: cmd_work_handler:921:(pid 2857): failed to allocate command entry
mlx5_0, 1: ipoib_intf_alloc failed -125
mlx5_core 0000:21:00.1: cb_timeout_handler:882:(pid 220): CREATE_MKEY(0x200) timeout. Will cause a leak of a command resource
infiniband mlx5_0: create_mkey_callback:206:(pid 220): async reg mr failed. status -110
mlx5_core 0000:21:00.1: cb_timeout_handler:882:(pid 220): CREATE_MKEY(0x200) timeout. Will cause a leak of a command resource
infiniband mlx5_0: create_mkey_callback:206:(pid 220): async reg mr failed. status -110
mlx5_core 0000:21:00.1: cb_timeout_handler:882:(pid 220): CREATE_MKEY(0x200) timeout. Will cause a leak of a command resource
infiniband mlx5_0: create_mkey_callback:206:(pid 220): async reg mr failed. status -110