AMD 7302, MCX456A-FCAT, CentOS 7.8 - IB won't link, ibutils hang, reads of sysfs timeout

OS: CentOS 7.8.2003
kernel: 3.10.0-1160.42.2.el7.x86_64
CPU: AMD Epyc 7302
IB: MCX456A-FCAT
MOFED: 4.9-3.1.5.0-LTS

Seeing a weird issue. System boots fine, no hardware errors detected. MLX IB card will not link and several infiniband-diagnostics commands hang or timeout. All of the hangs (when run with strace) show hangs during the reading of various sysfs files (/sys/class/infiniband/mlx5_0/ports/1/pkeys/*) or other files under (/sys/class/infiniband/mlx5_0/ports/1/).

I am able to see the HCA via lspci with proper speed and width detected. No PCIe errors. Loading of mlx5_core reports a good pcie bandwidth heuristic value. I can pull VPD data from the HCA with no problems. I can run mlxconfig query and flint query with no problem. The card reports, via mlx5_core dmesg, when a cable is connected or disconnected. If I run mlx_cables I get cable data. The card it in IB mode (not ethernet).

Every time an IB command hands (ibaddr, ibswitches, ibstatus) there is a random hang/timeout and running strace shows weird errors like -EEXISTS (file already exists when loading kernel module) or read errors on sysfs files or pointers.

More output/diag data below. Ideas? It seems hardware works but the sysfs hangs cause the problems.

Below the command strace ibaddr -L hangs trying to read from sysfs after several other sysfs files were successfully opened and read. Hangs on reading ports/1/pkeys/10

open("/sys/class/infiniband/mlx5_0/ports/1/pkeys/4", O_RDONLY) = 4
read(4, "0x8436\n", 32)                 = 7
close(4)                                = 0
open("/sys/class/infiniband/mlx5_0/ports/1/pkeys/5", O_RDONLY) = 4
read(4, "0x9336\n", 32)                 = 7
close(4)                                = 0
open("/sys/class/infiniband/mlx5_0/ports/1/pkeys/6", O_RDONLY) = 4
read(4, "0x8437\n", 32)                 = 7
close(4)                                = 0
open("/sys/class/infiniband/mlx5_0/ports/1/pkeys/7", O_RDONLY) = 4
read(4, "0x8438\n", 32)                 = 7
close(4)                                = 0
open("/sys/class/infiniband/mlx5_0/ports/1/pkeys/8", O_RDONLY) = 4
read(4, "0x9338\n", 32)                 = 7
close(4)                                = 0
open("/sys/class/infiniband/mlx5_0/ports/1/pkeys/9", O_RDONLY) = 4
read(4, "0x9340\n", 32)                 = 7
close(4)                                = 0
open("/sys/class/infiniband/mlx5_0/ports/1/pkeys/10", O_RDONLY) = 4
read(4,

Non-IB commands like mlxvpd or lspci work fine.

# mlxvpd -d 21:00.0

  VPD-KEYWORD    DESCRIPTION             VALUE
  -----------    -----------             -----
Read Only Section:

  PN             Part Number             MCX456A-FCAT
  EC             Revision                AF
  SN             Serial Number           MT1827K06622
  V0             Misc Info               PCIeGen3 x16
  RV             Checksum Complement     0x18
  IDTAG          Board Id                CX456A - ConnectX-4 QSFP28
# lspci -s 21:00.0 -vvv
21:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
	Subsystem: Mellanox Technologies Device 0012
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 120
	NUMA node: 0
	Region 0: Memory at 3007e000000 (64-bit, prefetchable) [size=32M]
	Expansion ROM at a6400000 [disabled] [size=1M]
	Capabilities: [60] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
		DevCtl:	Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 512 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [48] Vital Product Data
		Product Name: CX456A - ConnectX-4 QSFP28
		Read-only fields:
			[PN] Part number: MCX456A-FCAT
			[EC] Engineering changes: AF
			[SN] Serial number: MT1827K06622
			[V0] Vendor specific: PCIeGen3 x16
			[RV] Reserved: checksum good, 2 byte(s) reserved
		End
	Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00003000
	Capabilities: [c0] Vendor Specific Information: Len=18 <?>
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP+ CmpltTO+ CmpltAbrt- UnxCmplt+ RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		AERCap:	First Error Pointer: 04, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 1
		ARICtl:	MFVC- ACS-, Function Group: 0
	Capabilities: [1c0 v1] #19
	Capabilities: [230 v1] Access Control Services
		ACSCap:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
	Kernel driver in use: mlx5_core
	Kernel modules: mlx5_core

dmesg -T | grep mlx

mlx5_core 0000:21:00.0: firmware version: 12.24.1000
 mlx5_core 0000:21:00.0: 126.016 Gb/s available PCIe bandwidth (8 GT/s x16 link)
 mlx5_core 0000:21:00.0: irq 188 for MSI/MSI-X
(repeats in sequence to irq 220)
 mlx5_core 0000:21:00.0: irq 220 for MSI/MSI-X
 mlx5_core 0000:21:00.0: Port module event: module 0, Cable unplugged
 mlx5_core 0000:21:00.0: wait_func_handle_exec_timeout:1049:(pid 466): Recovered 1 EQEs on cmd_eq
 mlx5_core 0000:21:00.0: mlx5_fw_tracer_start:810:(pid 466): FWTracer: Ownership granted and active
 mlx5_core 0000:21:00.0: wait_func_handle_exec_timeout:1049:(pid 466): Recovered 2 EQEs on cmd_eq
 mlx5_core 0000:21:00.1: firmware version: 12.24.1000
 mlx5_core 0000:21:00.1: 126.016 Gb/s available PCIe bandwidth (8 GT/s x16 link)
 mlx5_core 0000:21:00.1: irq 222 for MSI/MSI-X
(repeats in sequence to irq 254)
 mlx5_core 0000:21:00.1: irq 254 for MSI/MSI-X
 mlx5_core 0000:21:00.1: Port module event: module 1, Cable plugged
 mlx5_ib: Mellanox Connect-IB Infiniband driver v4.9-3.1.5
 mlx5_core 0000:21:00.0: wait_func_handle_exec_timeout:1049:(pid 557): Recovered 2 EQEs on cmd_eq
 mlx5_core 0000:21:00.0: wait_func_handle_exec_timeout:1049:(pid 557): Recovered 2 EQEs on cmd_eq
 mlx5_core 0000:21:00.0: wait_func_handle_exec_timeout:1054:(pid 781): Recovered 0 EQEs on cmd_eq, no done completion for ent (1)
 mlx5_core 0000:21:00.0: wait_func_handle_exec_timeout:1049:(pid 557): Recovered 2 EQEs on cmd_eq
 mlx5_core 0000:21:00.0: wait_func:1081:(pid 781): QUERY_HCA_VPORT_PKEY(0x765) timeout. Will cause a leak of a command resource
 infiniband mlx5_0: ib_query_pkey failed (-110) for index 31
 infiniband mlx5_0: mlx5_ib_enable_driver:7341:(pid 781): Testing write-combining support failed with error=-2
 mlx5_core 0000:21:00.1: wait_func_handle_exec_timeout:1049:(pid 781): Recovered 2 EQEs on cmd_eq
 mlx5_core 0000:21:00.0: wait_func_handle_exec_timeout:1049:(pid 557): Recovered 1 EQEs on cmd_eq
 mlx5_core 0000:21:00.1: wait_func:1085:(pid 781): CREATE_MKEY(0x200) canceled on out of queue timeout.
 infiniband mlx5_0: init_driver_cnak:1251:(pid 781): failed to create dc DMA MR
 mlx5_core 0000:21:00.1: wait_func_handle_exec_timeout:1049:(pid 756): Recovered 43 EQEs on cmd_eq
 mlx5_core 0000:21:00.0: wait_func_handle_exec_timeout:1049:(pid 557): Recovered 1 EQEs on cmd_eq
 mlx5_core 0000:21:00.1: wait_func_handle_exec_timeout:1049:(pid 759): Recovered 32 EQEs on cmd_eq
 mlx5_core 0000:21:00.0: wait_func_handle_exec_timeout:1049:(pid 557): Recovered 1 EQEs on cmd_eq
 mlx5_core 0000:21:00.1: wait_func_handle_exec_timeout:1049:(pid 759): Recovered 32 EQEs on cmd_eq
 mlx5_core 0000:21:00.1: wait_func_handle_exec_timeout:1049:(pid 756): Recovered 32 EQEs on cmd_eq
 mlx5_core 0000:21:00.1: wait_func:1085:(pid 2758): CREATE_MKEY(0x200) canceled on out of queue timeout.
 mlx5_core 0000:21:00.1: mlx5e_create_mdev_resources:111:(pid 2758): create mkey failed, -125
 mlx5_core 0000:21:00.1: cb_timeout_handler:882:(pid 221): CREATE_MKEY(0x200) timeout. Will cause a leak of a command resource
 mlx5_core 0000:21:00.1: cb_timeout_handler:882:(pid 220): CREATE_MKEY(0x200) timeout. Will cause a leak of a command resource
 infiniband mlx5_0: create_mkey_callback:206:(pid 220): async reg mr failed. status -110
 infiniband mlx5_0: create_mkey_callback:206:(pid 221): async reg mr failed. status -110
 mlx5_core 0000:21:00.1: cb_timeout_handler:882:(pid 220): CREATE_MKEY(0x200) timeout. Will cause a leak of a command resource
 infiniband mlx5_0: create_mkey_callback:206:(pid 220): async reg mr failed. status -110
 mlx5_core 0000:21:00.1: cmd_work_handler:921:(pid 2857): failed to allocate command entry
 infiniband mlx5_0: create_mkey_callback:206:(pid 2857): async reg mr failed. status -11
 mlx5_core 0000:21:00.1: cmd_work_handler:921:(pid 2857): failed to allocate command entry
 infiniband mlx5_0: create_mkey_callback:206:(pid 2857): async reg mr failed. status -11
 mlx5_core 0000:21:00.1: cmd_work_handler:921:(pid 2857): failed to allocate command entry
 infiniband mlx5_0: create_mkey_callback:206:(pid 2857): async reg mr failed. status -11
 mlx5_core 0000:21:00.1: cmd_work_handler:921:(pid 2857): failed to allocate command entry
 infiniband mlx5_0: create_mkey_callback:206:(pid 2857): async reg mr failed. status -11
 mlx5_core 0000:21:00.1: cmd_work_handler:921:(pid 2857): failed to allocate command entry
 infiniband mlx5_0: create_mkey_callback:206:(pid 2857): async reg mr failed. status -11
 mlx5_core 0000:21:00.1: cmd_work_handler:921:(pid 2857): failed to allocate command entry
 infiniband mlx5_0: create_mkey_callback:206:(pid 2857): async reg mr failed. status -11
 mlx5_core 0000:21:00.1: cmd_work_handler:921:(pid 2857): failed to allocate command entry
 mlx5_core 0000:21:00.1: cmd_work_handler:921:(pid 2857): failed to allocate command entry
 mlx5_0, 1: ipoib_intf_alloc failed -125
 mlx5_core 0000:21:00.1: cb_timeout_handler:882:(pid 220): CREATE_MKEY(0x200) timeout. Will cause a leak of a command resource
 infiniband mlx5_0: create_mkey_callback:206:(pid 220): async reg mr failed. status -110
 mlx5_core 0000:21:00.1: cb_timeout_handler:882:(pid 220): CREATE_MKEY(0x200) timeout. Will cause a leak of a command resource
 infiniband mlx5_0: create_mkey_callback:206:(pid 220): async reg mr failed. status -110
 mlx5_core 0000:21:00.1: cb_timeout_handler:882:(pid 220): CREATE_MKEY(0x200) timeout. Will cause a leak of a command resource
 infiniband mlx5_0: create_mkey_callback:206:(pid 220): async reg mr failed. status -110

Hi,

Can you please upgrade the HCA firmware?

MOFED: 4.9-3.1.5.0-LTS Release Notes lists 12.28.2006 as the recommended firmware version for this driver.
The current FW on the HCA is 12.24.1000.

Firmware download location: https://network.nvidia.com/support/firmware/connectx4ib/

Thanks!

1 Like

Firmware updated to 12.28.2006. Failure still occurs with MOFED 4.9-3.1.5.0-LTS.

I did a full uninstall of MOFED 4.9-3.1.5.0-LTS and removed all kernel modules from initramfs, Rebooted clean and installed MOFED 4.9-7.1.0.0-LTS. Same failures.

I did additional diagnostics when starting openib and results are below. It appears the mlx5_core driver fails to allocate BAR 0 memory and the rest of the failures ensue.

Command run: /etc/init.d/openibd start

/* cat /var/log/messages */

root[8158]: openibd: running in manual mode
root[8255]: openibd: running in manual mode
kernel: Compat-mlnx-ofed backport release: 382c630
kernel: Backport based on mlnx_ofed/mlnx-ofa_kernel-4.0.git 382c630
kernel: compat.git: mlnx_ofed/mlnx-ofa_kernel-4.0.git
kernel: mlx5_core 0000:21:00.0: BAR 0: can't reserve [mem 0x3007e000000-0x3007fffffff 64bit pref]
kernel: mlx5_core 0000:21:00.0: Couldn't get PCI resources, aborting
kernel: mlx5_core 0000:21:00.0: mlx5_pci_init:1055:(pid 8347): error requesting BARs, aborting
kernel: mlx5_core 0000:21:00.0: init_one:2142:(pid 8347): mlx5_pci_init failed with error code -16

No other errors, other than mlx5_core errors. No CPU, Memory, PCIe or MCE errors reported.

Complete diagnostic output

/* mlxfwmanager -d 21:00.0 --query */

Querying Mellanox devices firmware ...

Device #1:
----------

  Device Type:      ConnectX4
  Part Number:      MCX456A-FCA_Ax
  Description:      ConnectX-4 VPI adapter card; FDR IB (56Gb/s) and 40GbE; dual-port QSFP28; PCIe3.0 x16; ROHS R6
  PSID:             MT_2170111021
  PCI Device Name:  21:00.0
  Base GUID:        98039b03000525f2
  Versions:         Current        Available
     FW             12.28.2006     N/A
     PXE            3.6.0102       N/A
     UEFI           14.21.0017     N/A

  Status:           No matching image found

-------------------------

/* cat /var/log/messages */

root[8158]: openibd: running in manual mode
root[8255]: openibd: running in manual mode
kernel: Compat-mlnx-ofed backport release: 382c630
kernel: Backport based on mlnx_ofed/mlnx-ofa_kernel-4.0.git 382c630
kernel: compat.git: mlnx_ofed/mlnx-ofa_kernel-4.0.git
kernel: mlx5_core 0000:21:00.0: BAR 0: can't reserve [mem 0x3007e000000-0x3007fffffff 64bit pref]
kernel: mlx5_core 0000:21:00.0: Couldn't get PCI resources, aborting
kernel: mlx5_core 0000:21:00.0: mlx5_pci_init:1055:(pid 8347): error requesting BARs, aborting
kernel: mlx5_core 0000:21:00.0: init_one:2142:(pid 8347): mlx5_pci_init failed with error code -16
kernel: mlx5_core: probe of 0000:21:00.0 failed with error -16
kernel: mlx5_core 0000:21:00.1: firmware version: 12.28.2006
kernel: mlx5_core 0000:21:00.1: 126.016 Gb/s available PCIe bandwidth (8 GT/s x16 link)
kernel: mlx5_core 0000:21:00.1: Port module event: module 1, Cable unplugged
kernel: mlx5_core 0000:21:00.1: mlx5_fw_tracer_start:810:(pid 8347): FWTracer: Ownership granted and active
kernel: mlx5_ib: Mellanox Connect-IB Infiniband driver v4.9-7.1.0
systemd: Started Session 25 of user root.
kernel: mlx5_core 0000:21:00.1: wait_func_handle_exec_timeout:1049:(pid 8365): Recovered 2 EQEs on cmd_eq
kernel: mlx5_core 0000:21:00.1: wait_func_handle_exec_timeout:1049:(pid 8354): Recovered 2 EQEs on cmd_eq
kernel: mlx5_core 0000:21:00.1: wait_func_handle_exec_timeout:1054:(pid 8365): Recovered 0 EQEs on cmd_eq, no done completion for ent (1)
kernel: mlx5_core 0000:21:00.1: wait_func:1081:(pid 8365): QUERY_HCA_VPORT_PKEY(0x765) timeout. Will cause a leak of a command resource
kernel: infiniband mlx5_0: ib_query_pkey failed (-110) for index 1
kernel: infiniband mlx5_0: mlx5_ib_enable_driver:7343:(pid 8365): Testing write-combining support failed with error=-2
kernel: mlx5_core 0000:21:00.1: wait_func_handle_exec_timeout:1049:(pid 8354): Recovered 1 EQEs on cmd_eq

/* cat /proc/iomem */

# cat /proc/iomem
00000000-00000fff : reserved
00001000-0009ffff : System RAM
000a0000-000fffff : reserved
  000a0000-000bffff : PCI Bus 0000:60
  000c0000-000dffff : PCI Bus 0000:00
  000f0000-000fffff : System ROM
00100000-2fffffff : System RAM
  25000000-2fffffff : Crash kernel
30000000-30041fff : ACPI Non-volatile Storage
30042000-75daffff : System RAM      <---BAR 0 trying to allocate in this range
75db0000-75ffffff : reserved

/* dmesg */

Compat-mlnx-ofed backport release: 382c630
Backport based on mlnx_ofed/mlnx-ofa_kernel-4.0.git 382c630
compat.git: mlnx_ofed/mlnx-ofa_kernel-4.0.git
mlx5_core 0000:21:00.0: BAR 0: can't reserve [mem 0x3007e000000-0x3007fffffff 64bit pref]
mlx5_core 0000:21:00.0: Couldn't get PCI resources, aborting
mlx5_core 0000:21:00.0: mlx5_pci_init:1055:(pid 8347): error requesting BARs, aborting
mlx5_core 0000:21:00.0: init_one:2142:(pid 8347): mlx5_pci_init failed with error code -16
mlx5_core: probe of 0000:21:00.0 failed with error -16
mlx5_core 0000:21:00.1: firmware version: 12.28.2006
mlx5_core 0000:21:00.1: 126.016 Gb/s available PCIe bandwidth (8 GT/s x16 link)
mlx5_core 0000:21:00.1: irq 189 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 190 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 191 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 192 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 193 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 194 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 195 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 196 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 197 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 198 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 199 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 200 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 201 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 202 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 203 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 204 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 205 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 206 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 207 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 208 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 209 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 210 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 211 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 212 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 213 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 214 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 215 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 216 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 217 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 218 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 219 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 220 for MSI/MSI-X
mlx5_core 0000:21:00.1: irq 221 for MSI/MSI-X
mlx5_core 0000:21:00.1: Port module event: module 1, Cable unplugged
mlx5_core 0000:21:00.1: mlx5_fw_tracer_start:810:(pid 8347): FWTracer: Ownership granted and active
mlx5_ib: Mellanox Connect-IB Infiniband driver v4.9-7.1.0
mlx5_core 0000:21:00.1: wait_func_handle_exec_timeout:1049:(pid 8365): Recovered 2 EQEs on cmd_eq
mlx5_core 0000:21:00.1: wait_func_handle_exec_timeout:1049:(pid 8354): Recovered 2 EQEs on cmd_eq
mlx5_core 0000:21:00.1: wait_func_handle_exec_timeout:1054:(pid 8365): Recovered 0 EQEs on cmd_eq, no done completion for ent (1)
mlx5_core 0000:21:00.1: wait_func:1081:(pid 8365): QUERY_HCA_VPORT_PKEY(0x765) timeout. Will cause a leak of a command resource
infiniband mlx5_0: ib_query_pkey failed (-110) for index 1
infiniband mlx5_0: mlx5_ib_enable_driver:7343:(pid 8365): Testing write-combining support failed with error=-2
mlx5_core 0000:21:00.1: wait_func_handle_exec_timeout:1049:(pid 8354): Recovered 1 EQEs on cmd_eq

-------------------------

lspci -S $MLX_ADDR -vvv

21:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
	Subsystem: Mellanox Technologies Device 0012
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 137
	NUMA node: 0
	Region 0: Memory at 3007e000000 (64-bit, prefetchable) [size=32M]
	Expansion ROM at a6400000 [disabled] [size=1M]
	Capabilities: [60] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
		DevCtl:	Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 512 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 <4us
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABC, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [48] Vital Product Data
		Product Name: CX456A - ConnectX-4 QSFP28
		Read-only fields:
			[PN] Part number: MCX456A-FCAT
			[EC] Engineering changes: AF
			[SN] Serial number: MT1827K06622
			[V0] Vendor specific: PCIeGen3 x16
			[RV] Reserved: checksum good, 2 byte(s) reserved
		End
	Capabilities: [9c] MSI-X: Enable- Count=64 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00003000
	Capabilities: [c0] Vendor Specific Information: Len=18 <?>
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP+ CmpltTO+ CmpltAbrt- UnxCmplt+ RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		AERCap:	First Error Pointer: 04, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 1
		ARICtl:	MFVC- ACS-, Function Group: 0
	Capabilities: [1c0 v1] #19
	Capabilities: [230 v1] Access Control Services
		ACSCap:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
	Kernel modules: mlx5_core

21:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
	Subsystem: Mellanox Technologies Device 0012
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin B routed to IRQ 188
	NUMA node: 0
	Region 0: Memory at 3007c000000 (64-bit, prefetchable) [size=32M]
	Expansion ROM at a6300000 [disabled] [size=1M]
	Capabilities: [60] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
		DevCtl:	Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 512 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 <4us
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABC, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [48] Vital Product Data
		Product Name: CX456A - ConnectX-4 QSFP28
		Read-only fields:
			[PN] Part number: MCX456A-FCAT
			[EC] Engineering changes: AF
			[SN] Serial number: MT1827K06622
			[V0] Vendor specific: PCIeGen3 x16
			[RV] Reserved: checksum good, 2 byte(s) reserved
		End
	Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00003000
	Capabilities: [c0] Vendor Specific Information: Len=18 <?>
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP+ CmpltTO+ CmpltAbrt- UnxCmplt+ RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		AERCap:	First Error Pointer: 04, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 0
		ARICtl:	MFVC- ACS-, Function Group: 0
	Capabilities: [230 v1] Access Control Services
		ACSCap:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
	Kernel driver in use: mlx5_core
	Kernel modules: mlx5_core