Hello,
I have to computers with Mellanox ConnectX-3 Infiniband cards connected with each other directly. I configured several VMs on each node with SR IOV passthrough of Infiniband cards. When I was mostly done I tried to also configure IB to make it usable on the host. I rebooted the hosts and saw that the IB cards completely disappeared from the PCI bus. So I rebooted the system several times again and one of the IB cards reappeared. But another one is still missing. I completely disconnected the host from any cable and even unplugged and plugged the card, but this had no effect.
Important fact is that when I boot any of the nodes, one of the first screens which I see during the boot process shows some message from IB firmware. There I can enter into some menu and enable or disable SR-IOV, set maximum number of physical functions, and some other things. When the IB card is gone from lspci, the boot screen from the firmware does not appear.
Now I try to describe my system and outline the actions I took when I configured IB passthrough. As the host I have Debian 9 and I installed IB drivers from the Debian repository. On the guests I have Centos 7.3 and there I installed Mellanox distribution of OFED for Centos 7.3. For virtualization I use Qemu/KVM with libvirt.
My card shows on the host as:
05:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
05:00.1 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]
05:00.2 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]
05:00.3 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]
05:00.4 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]
Both host and guest used mlx4_core drivers, here is the list of some of the modules in the host system:
Module Size Used by
mlx4_ib 163840 0
mlx4_en 114688 0
mlx4_core 303104 2 mlx4_en,mlx4_ib
kvm_intel 192512 0
kvm 589824 1 kvm_intel
irqbypass 16384 1 kvm
ib_umad 24576 0
ib_core 208896 2 ib_umad,mlx4_ib
I also was loading ib_ipoib on the host, as well as on the guest. But on the guest it was crashing the kernel.
Additional suspicious thing happened when I was attaching virtual functions to the guest systems (sudo virsh attach-device …). Following messages were appearing in the kernel log:
Jul 6 16:07:04 ib1 kernel: [ 281.707448] vfio-pci 0000:05:00.4: enabling device (0000 → 0002)
Jul 6 16:07:06 ib1 kernel: [ 283.475412] virbr1: port 5(vnet3) entered learning state
Jul 6 16:07:08 ib1 kernel: [ 285.491419] virbr1: port 5(vnet3) entered forwarding state
Jul 6 16:07:08 ib1 kernel: [ 285.491424] virbr1: topology change detected, propagating
Jul 6 16:07:13 ib1 kernel: [ 290.895918] kvm [2264]: vcpu0, guest rIP: 0xffffffff81060d78 disabled perfctr wrmsr: 0xc2 data 0xffff
Jul 6 16:07:13 ib1 kernel: [ 290.933587] kvm: zapping shadow pages for mmio generation wraparound
Jul 6 16:07:13 ib1 kernel: [ 290.939149] kvm: zapping shadow pages for mmio generation wraparound
Jul 6 16:07:14 ib1 kernel: [ 291.721929] mlx4_core 0000:05:00.0: Received reset from slave:4
Jul 6 16:07:14 ib1 kernel: [ 291.767436] mlx4_core 0000:05:00.0: Unknown command:0x55 accepted from slave:4
Jul 7 07:52:13 ib1 kernel: [56990.799006] mlx4_core 0000:05:00.0: mlx4_eq_int: slave:2, srq_no:0x41, event: 14(00)
Jul 7 07:52:13 ib1 kernel: [56990.799009] mlx4_core 0000:05:00.0: mlx4_eq_int: sending event 14(00) to slave:2
Jul 7 08:39:31 ib1 kernel: [59828.975516] mlx4_core 0000:05:00.0: Received reset from slave:4
Jul 7 08:39:31 ib1 kernel: [59829.044683] virbr1: port 5(vnet3) entered disabled state
Jul 7 08:39:31 ib1 kernel: [59829.044752] device vnet3 left promiscuous mode
Note the line with “Unknown command”.
I did not update the firmware, at least no in a recent time.
ibstat on the working system says following:
CA ‘mlx4_0’
CA type: MT4099
Number of ports: 2
Firmware version: 2.34.5000
Hardware version: 0
Node GUID: 0xf45214030010a4a0
System image GUID: 0xf45214030010a4a3
Port 1:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x0250486a
Port GUID: 0xf45214030010a4a1
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x0250486a
Port GUID: 0xf45214030010a4a2
Link layer: InfiniBand
Could you help me to get my card back?