Mellanox card disappeared from PCI bus

Hello,

I have to computers with Mellanox ConnectX-3 Infiniband cards connected with each other directly. I configured several VMs on each node with SR IOV passthrough of Infiniband cards. When I was mostly done I tried to also configure IB to make it usable on the host. I rebooted the hosts and saw that the IB cards completely disappeared from the PCI bus. So I rebooted the system several times again and one of the IB cards reappeared. But another one is still missing. I completely disconnected the host from any cable and even unplugged and plugged the card, but this had no effect.

Important fact is that when I boot any of the nodes, one of the first screens which I see during the boot process shows some message from IB firmware. There I can enter into some menu and enable or disable SR-IOV, set maximum number of physical functions, and some other things. When the IB card is gone from lspci, the boot screen from the firmware does not appear.

Now I try to describe my system and outline the actions I took when I configured IB passthrough. As the host I have Debian 9 and I installed IB drivers from the Debian repository. On the guests I have Centos 7.3 and there I installed Mellanox distribution of OFED for Centos 7.3. For virtualization I use Qemu/KVM with libvirt.

My card shows on the host as:

05:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]

05:00.1 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

05:00.2 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

05:00.3 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

05:00.4 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

Both host and guest used mlx4_core drivers, here is the list of some of the modules in the host system:

Module Size Used by

mlx4_ib 163840 0

mlx4_en 114688 0

mlx4_core 303104 2 mlx4_en,mlx4_ib

kvm_intel 192512 0

kvm 589824 1 kvm_intel

irqbypass 16384 1 kvm

ib_umad 24576 0

ib_core 208896 2 ib_umad,mlx4_ib

I also was loading ib_ipoib on the host, as well as on the guest. But on the guest it was crashing the kernel.

Additional suspicious thing happened when I was attaching virtual functions to the guest systems (sudo virsh attach-device …). Following messages were appearing in the kernel log:

Jul 6 16:07:04 ib1 kernel: [ 281.707448] vfio-pci 0000:05:00.4: enabling device (0000 → 0002)

Jul 6 16:07:06 ib1 kernel: [ 283.475412] virbr1: port 5(vnet3) entered learning state

Jul 6 16:07:08 ib1 kernel: [ 285.491419] virbr1: port 5(vnet3) entered forwarding state

Jul 6 16:07:08 ib1 kernel: [ 285.491424] virbr1: topology change detected, propagating

Jul 6 16:07:13 ib1 kernel: [ 290.895918] kvm [2264]: vcpu0, guest rIP: 0xffffffff81060d78 disabled perfctr wrmsr: 0xc2 data 0xffff

Jul 6 16:07:13 ib1 kernel: [ 290.933587] kvm: zapping shadow pages for mmio generation wraparound

Jul 6 16:07:13 ib1 kernel: [ 290.939149] kvm: zapping shadow pages for mmio generation wraparound

Jul 6 16:07:14 ib1 kernel: [ 291.721929] mlx4_core 0000:05:00.0: Received reset from slave:4

Jul 6 16:07:14 ib1 kernel: [ 291.767436] mlx4_core 0000:05:00.0: Unknown command:0x55 accepted from slave:4

Jul 7 07:52:13 ib1 kernel: [56990.799006] mlx4_core 0000:05:00.0: mlx4_eq_int: slave:2, srq_no:0x41, event: 14(00)

Jul 7 07:52:13 ib1 kernel: [56990.799009] mlx4_core 0000:05:00.0: mlx4_eq_int: sending event 14(00) to slave:2

Jul 7 08:39:31 ib1 kernel: [59828.975516] mlx4_core 0000:05:00.0: Received reset from slave:4

Jul 7 08:39:31 ib1 kernel: [59829.044683] virbr1: port 5(vnet3) entered disabled state

Jul 7 08:39:31 ib1 kernel: [59829.044752] device vnet3 left promiscuous mode

Note the line with “Unknown command”.

I did not update the firmware, at least no in a recent time.

ibstat on the working system says following:

CA ‘mlx4_0’

CA type: MT4099

Number of ports: 2

Firmware version: 2.34.5000

Hardware version: 0

Node GUID: 0xf45214030010a4a0

System image GUID: 0xf45214030010a4a3

Port 1:

State: Down

Physical state: Polling

Rate: 10

Base lid: 0

LMC: 0

SM lid: 0

Capability mask: 0x0250486a

Port GUID: 0xf45214030010a4a1

Link layer: InfiniBand

Port 2:

State: Down

Physical state: Polling

Rate: 10

Base lid: 0

LMC: 0

SM lid: 0

Capability mask: 0x0250486a

Port GUID: 0xf45214030010a4a2

Link layer: InfiniBand

Could you help me to get my card back?

Hi,

I will explain in few lines:

Debian 9 is not yet supported and to investigate a such problem of module loading and adapter initialization problem, it would be preferable to get the latest officially driver version and check if the symptom exists also there.

A such distribution doesn’t allow to check the above possibility,

Does the problem exists also without SRIO-V ?

Do you also have problem to load ib_ipoib w/o SRIO-V ?

Can you send me your dmesg (with SRIO-V disabled) to see driver init phase ?

Thanks

Marc

Debian 9 is not supported

Hi,

FYI:

MOFED 4.2 is planned for end of october 2017 and will support Debian 9.

BR

Marc