Currently I am trying to get the Mellanox connected to a KVM-Qemu VM as the application that I need to run in a virtual machine needs to have access to RDMA.
I have followed the following guide:
https://enterprise-support.nvidia.com/s/article/HowTo-Configure-SR-IOV-for-ConnectX-4-ConnectX-5-ConnectX-6-with-KVM-Ethernet#jive_content_id_Overview
I managed to get 16 VF’s (4 for each port) and everything looks fine until I tried to add to the KVM Qemu:
error: Failed to start domain Ubuntu_test
error: internal error: qemu unexpectedly closed the monitor: 2022-12-16T14:28:08.442593Z qemu-system-x86_64: -device vfio-pci,host=0000:81:00.2,id=hostdev0,bus=pci.7,addr=0x0: vfio 0000:81:00.2: group 96 is not viable
Please ensure all devices within the iommu_group are bound to their vfio bus driver.
After looking further I noticed that each card is fully in their own group, which I suspect is the problem:
Card 1:
IOMMU Group 96 80:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 96 80:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
IOMMU Group 96 81:00.0 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] [15b3:1019]
IOMMU Group 96 81:00.1 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] [15b3:1019]
IOMMU Group 96 81:00.2 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]
IOMMU Group 96 81:00.3 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]
IOMMU Group 96 81:00.4 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]
IOMMU Group 96 81:00.5 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]
IOMMU Group 96 81:00.6 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]
IOMMU Group 96 81:00.7 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]
IOMMU Group 96 81:01.0 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]
IOMMU Group 96 81:01.1 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]
Card 2:
IOMMU Group 79 c0:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 79 c0:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
IOMMU Group 79 c1:00.0 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] [15b3:1019]
IOMMU Group 79 c1:00.1 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] [15b3:1019]
IOMMU Group 79 c1:00.2 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]
IOMMU Group 79 c1:00.3 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]
IOMMU Group 79 c1:00.4 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]
IOMMU Group 79 c1:00.5 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]
IOMMU Group 79 c1:00.6 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]
IOMMU Group 79 c1:00.7 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]
IOMMU Group 79 c1:01.0 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]
IOMMU Group 79 c1:01.1 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]
even trying to unbind the card unfortunately does not help me any further. Do I need to get these all in their own group and if so how should I do it? Also, is there a different option to get it passed through to the VM with the RDMA function?
The host is running Ubuntu 20.04;
uname -a
Linux hostserver1 5.4.0-135-generic #152-Ubuntu SMP Wed Nov 23 20:19:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Thank you