ENABLE_HCA timeout when enabling more than 7 VFs on ConnectX-5

Hello,

I am using MCX515A-CCAT NIC in an AMD-based system with X570-A PRO motherboard.

I want to use 64 VFs and enable them by writing that number to a system file:

echo 64 > /sys/class/net/eth1/device/mlx5_num_vfs

The card is plugged into 0a:00.0 PCIe slot.

In dmsg, I found that only 7 VFs were successfully enabled, and they were assigned slots from 0a:00.1 to 0a:00.7. When mlx5_core module tries to enable VF number 8, it fails with the next error:

mlx5_core 0000:0a:01.0: firmware version: 16.26.1040

mlx5_core 0000:0a:01.0: wait_func:1033:(pid 2261): ENABLE_HCA(0x104) timeout. Will > cause a leak of a command resource

mlx5_core 0000:0a:01.0: mlx5_function_setup:1266:(pid 2261): enable hca failed

mlx5_core 0000:0a:01.0: init_one:2116:(pid 2261): mlx5_load_one failed with error code -110

mlx5_core: probe of 0000:0a:01.0 failed with error -110

This happens both when I ran with pci=assign-busses kernel boot option and without it.

System information:

CPU

AMD Ryzen 9 3900X

Kernel

5.0.0-37-generic

OS

Ubuntu 18.04.3 LTS

Kernel Module

mlx5_core, version 4.7-1.0.0

BIOS

Vendor: American Megatrends Inc.

Version: H.60

BIOS Revision: 5.14

lspci output:

0a:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]

Subsystem: Mellanox Technologies MT27800 Family [ConnectX-5]

Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+

Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-

Latency: 0, Cache Line Size: 64 bytes

Interrupt: pin A routed to IRQ 61

Region 0: Memory at 1020000000 (64-bit, prefetchable) [size=32M]

Expansion ROM at fce00000 [disabled] [size=1M]

Capabilities: [60] Express (v2) Endpoint, MSI 00

DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited

ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W

DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-

RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-

MaxPayload 512 bytes, MaxReadReq 512 bytes

DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-

LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited

ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+

LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+

ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-

LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

DevCap2: Completion Timeout: Range ABC, TimeoutDis+, LTR-, OBFF Not Supported

DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled

LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-

Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-

Compliance De-emphasis: -6dB

LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+

EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-

Capabilities: [48] Vital Product Data

Product Name: CX515A - ConnectX-5 QSFP28

Read-only fields:

[PN] Part number: MCX515A-CCAT

[EC] Engineering changes: AA

[V2] Vendor specific: MCX515A-CCAT

[SN] Serial number: MT1936J02431

[V3] Vendor specific: f207701f11cfe9118000b8599fcc3578

[VA] Vendor specific: MLX:MODL=CX515A:MN=MLNX:CSKU=V2:UUID=V3:PCI=V0

[V0] Vendor specific: PCIeGen3 x16

[RV] Reserved: checksum good, 2 byte(s) reserved

End

Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-

Vector table: BAR=0 offset=00002000

PBA: BAR=0 offset=00003000

Capabilities: [c0] Vendor Specific Information: Len=18 <?>

Capabilities: [40] Power Management version 3

Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)

Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-

Capabilities: [100 v1] Advanced Error Reporting

UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-

UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-

UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-

CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-

CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+

AERCap: First Error Pointer: 04, GenCap+ CGenEn- ChkCap+ ChkEn-

Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)

ARICap: MFVC- ACS-, Next Function: 0

ARICtl: MFVC- ACS-, Function Group: 0

Capabilities: [180 v1] Single Root I/O Virtualization (SR-IOV)

IOVCap: Migration-, Interrupt Message Number: 000

IOVCtl: Enable+ Migration- Interrupt- MSE+ ARIHierarchy-

IOVSta: Migration-

Initial VFs: 127, Total VFs: 127, Number of VFs: 1, Function Dependency Link: 00

VF offset: 1, stride: 1, Device ID: 1018

Supported Page Size: 000007ff, System Page Size: 00000001

Region 0: Memory at 0000001022000000 (64-bit, prefetchable)

VF Migration: offset: 00000000, BIR: 0

Capabilities: [1c0 v1] #19

Kernel driver in use: mlx5_core

Kernel modules: mlx5_core

Would it possible to use more than 7 VFs on my system?

Hi Alexey,

Can you please check if your BIOS has ARI (Alternate Routing ID) setting enabled? If it is not enabled, please enable it for Virtual Functions and check again.

Thanks,

Namrata.

Hi Namrata,

I did not find ARI option in BIOS on the machine which I used initially.

However, I tried to plug the NIC to another server, and it worked there with up to 127 VFs. Though the second machine did not have an ARI option neither, only SR-IOV enable option turned on.

Could you provide with any guidelines on how to check if the NIC and its features are compatible with the system where it is plugged into?

Thanks,

Alexey