Trouble with ConnectX-3 VPI VFs with SR-IOV

Hi,

I am trying to get VFs working on the IB card to pass through to KVM guests. Following through the steps in HowTo Configure SR-IOV for ConnectX-3 with KVM (InfiniBand) https://community.mellanox.com/s/article/howto-configure-sr-iov-for-connectx-3-with-kvm--infiniband-x , I get in trouble after restarting openibd in step “Enable SR-IOV on the MLNX_OFED Driver” with the following snippets from dmesg output (see attachment for further detail):

[ 37.547412] mlx4_core: device is working in RoCE mode: Roce V1

[ 37.572033] mlx4_core: gid_type 1 for UD QPs is not supported by the devicegid_type 0 was chosen instead

[ 37.623776] mlx4_core: UD QP Gid type is: V1

[ 39.430768] mlx4_core 0000:41:00.0: Enabling SR-IOV with 4 VFs

[ 39.562398] pci 0000:41:00.1: [15b3:1004] type 00 class 0x028000

[ 39.569757] mlx4_core: Initializing 0000:41:00.1

[ 39.597827] mlx4_core 0000:41:00.1: enabling device (0000 → 0002)

[ 39.627547] mlx4_core 0000:41:00.1: Detected virtual function - running in slave mode

[ 39.684547] mlx4_core 0000:41:00.1: PF is not ready - Deferring probe

[ 39.714917] pci 0000:41:00.1: Driver mlx4_core requests probe deferral

[ 39.744881] pci 0000:41:00.2: [15b3:1004] type 00 class 0x028000

[ 39.752156] mlx4_core: Initializing 0000:41:00.2

[ 39.782028] mlx4_core 0000:41:00.2: enabling device (0000 → 0002)

[ 39.813140] mlx4_core 0000:41:00.2: Skipping virtual function:2

[ 39.843525] pci 0000:41:00.3: [15b3:1004] type 00 class 0x028000

[ 39.850805] mlx4_core: Initializing 0000:41:00.3

[ 39.879927] mlx4_core 0000:41:00.3: enabling device (0000 → 0002)

[ 39.909787] mlx4_core 0000:41:00.3: Skipping virtual function:3

[ 39.939078] pci 0000:41:00.4: [15b3:1004] type 00 class 0x028000

[ 39.946361] mlx4_core: Initializing 0000:41:00.4

[ 39.974914] mlx4_core 0000:41:00.4: enabling device (0000 → 0002)

[ 40.004714] mlx4_core 0000:41:00.4: Skipping virtual function:4

[ 40.033411] mlx4_core 0000:41:00.0: Running in master mode

— Stacks of MSI/MSI-X messages later —

[ 40.582243] mlx4_core: Initializing 0000:41:00.1

[ 40.610237] mlx4_core 0000:41:00.1: enabling device (0000 → 0002)

[ 40.639442] mlx4_core 0000:41:00.1: Detected virtual function - running in slave mode

[ 40.694489] mlx4_core 0000:41:00.1: Sending reset

[ 40.722845] mlx4_core 0000:41:00.0: Received reset from slave:1

[ 40.750438] mlx4_core 0000:41:00.1: Sending vhcr0

[ 40.777898] AMD-Vi: Event logged [IO_PAGE_FAULT device=41:00.1 domain=0x0000 address=0x00000037f7bde000 flags=0x0050]

[ 40.833233] AMD-Vi: Event logged [IO_PAGE_FAULT device=41:00.1 domain=0x0000 address=0x00000037f7bde040 flags=0x0050]

[ 40.890985] AMD-Vi: Event logged [IO_PAGE_FAULT device=41:00.1 domain=0x0000 address=0x00000037f7bde080 flags=0x0050]

[ 40.949797] AMD-Vi: Event logged [IO_PAGE_FAULT device=41:00.1 domain=0x0000 address=0x00000037f7bde0c0 flags=0x0050]

[ 46.047238] mlx4_core 0000:41:00.0: command 0x2e failed: fw status = 0x1

[ 46.077884] mlx4_core 0000:41:00.0: mlx4_master_process_vhcr: Failed reading vhcr ret: 0xfffffffb

[ 46.139267] mlx4_core 0000:41:00.0: Failed processing vhcr for slave:1, resetting slave

[ 46.203088] mlx4_core 0000:41:00.0: Turn on internal error to force reset, slave=1, cmd=0x5

[ 46.268572] mlx4_core 0000:41:00.0: slave:1 is out of sync, cmd=0x5, last command=0x0, reset is needed

[ 46.336826] mlx4_core 0000:41:00.0: Turn on internal error to force reset, slave=1, cmd=0x5

[ 46.406515] mlx4_core 0000:41:00.0: slave:1 is out of sync, cmd=0x5, last command=0x0, reset is needed

[ 46.476511] mlx4_core 0000:41:00.0: Turn on internal error to force reset, slave=1, cmd=0x5

[ 46.546482] mlx4_core 0000:41:00.1: HCA minimum page size:1

[ 46.582122] mlx4_core 0000:41:00.0: slave:1 is out of sync, cmd=0x5, last command=0x0, reset is needed

[ 46.653173] mlx4_core 0000:41:00.0: Turn on internal error to force reset, slave=1, cmd=0x5

[ 46.725318] mlx4_core 0000:41:00.1: The host supports neither eth nor rdma interfaces

[ 46.799557] mlx4_core 0000:41:00.1: QUERY_FUNC_CAP general command failed, aborting (-93)

[ 46.873709] mlx4_core 0000:41:00.1: Failed to obtain slave caps

[ 46.911030] mlx4_core 0000:41:00.0: Received reset from slave:1

[ 46.948493] mlx4_core: probe of 0000:41:00.1 failed with error -93

I am concerned about the AMD-Vi messages, googling doesn’t really offer many relevant answers. Running Ubuntu Trusty 14.04 (3.16 kernel, tried 4.2) with latest 3.3 OFED (tried 3.2 as well).

The card is a dual port CX3 VPI with port 1 connected at FDR:

PSID: MT_1090120019

The hypervisor is a Dell C6145 sled with latest firmware. SR-IOV is enabled in BIOS as well as IOMMU in grub. I’m coming from Intel land and not too familiar with AMD, does this look right or should I get something additional regarding IOMMU/HW virt/SR-IOV:

[ 0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-3.16.0-71-generic root=UUID=bc67403d-a8e1-4e30-bf48-36ffeecd04e0 ro iommu=pt

[ 4.167159] AMD-Vi: Found IOMMU at 0000:00:00.2 cap 0x40

[ 4.167163] AMD-Vi: Found IOMMU at 0000:40:00.2 cap 0x40

[ 4.167166] AMD-Vi: Interrupt remapping enabled

[ 4.167664] AMD-Vi: Initialized for Passthrough Mode

I do get the cards in lspci, but they seem non-functional:

41:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]

41:00.1 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

41:00.2 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

41:00.3 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

41:00.4 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

modprobe options for mlnx4_core:

options mlx4_core num_vfs=4 port_type_array=1,1 probe_vf=1

(changing probe_vf=0 doesn’t help, no interfaces with probe_vf=1)

Thanks for any suggestions!

Cheers

vc3-vfs.txt.zip (3.77 KB)

Hello Lasse,

I’m not familiar with this error. As I was looking into your error internally, i noticed you have opened a support case with us for the same issue already.

We will continue to assist you on that case.

Thanks.

.R