Mellanox Infiniband ConnectX-5

Currently I am trying to get the Mellanox connected to a KVM-Qemu VM as the application that I need to run in a virtual machine needs to have access to RDMA.

I have followed the following guide:
https://enterprise-support.nvidia.com/s/article/HowTo-Configure-SR-IOV-for-ConnectX-4-ConnectX-5-ConnectX-6-with-KVM-Ethernet#jive_content_id_Overview

I managed to get 16 VF’s (4 for each port) and everything looks fine until I tried to add to the KVM Qemu:

error: Failed to start domain Ubuntu_test
error: internal error: qemu unexpectedly closed the monitor: 2022-12-16T14:28:08.442593Z qemu-system-x86_64: -device vfio-pci,host=0000:81:00.2,id=hostdev0,bus=pci.7,addr=0x0: vfio 0000:81:00.2: group 96 is not viable
Please ensure all devices within the iommu_group are bound to their vfio bus driver.

After looking further I noticed that each card is fully in their own group, which I suspect is the problem:

Card 1:
IOMMU Group 96 80:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 96 80:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
IOMMU Group 96 81:00.0 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] [15b3:1019]
IOMMU Group 96 81:00.1 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] [15b3:1019]
IOMMU Group 96 81:00.2 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]
IOMMU Group 96 81:00.3 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]
IOMMU Group 96 81:00.4 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]
IOMMU Group 96 81:00.5 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]
IOMMU Group 96 81:00.6 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]
IOMMU Group 96 81:00.7 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]
IOMMU Group 96 81:01.0 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]
IOMMU Group 96 81:01.1 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]

Card 2:
IOMMU Group 79 c0:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 79 c0:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
IOMMU Group 79 c1:00.0 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] [15b3:1019]
IOMMU Group 79 c1:00.1 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] [15b3:1019]
IOMMU Group 79 c1:00.2 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]
IOMMU Group 79 c1:00.3 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]
IOMMU Group 79 c1:00.4 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]
IOMMU Group 79 c1:00.5 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]
IOMMU Group 79 c1:00.6 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]
IOMMU Group 79 c1:00.7 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]
IOMMU Group 79 c1:01.0 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]
IOMMU Group 79 c1:01.1 Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] [15b3:101a]

even trying to unbind the card unfortunately does not help me any further. Do I need to get these all in their own group and if so how should I do it? Also, is there a different option to get it passed through to the VM with the RDMA function?

The host is running Ubuntu 20.04;
uname -a
Linux hostserver1 5.4.0-135-generic #152-Ubuntu SMP Wed Nov 23 20:19:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Thank you

Make sure baremetal (Hypervisor) and Vm’s are running our MLNX_OFED driver.

Our driver embed as well the supported FW.

The error you are getting (posted below) points to an issue with the vfio driver and iommu group and not related to SRIOV/Nvidia (non Nvidia developed).

error: Failed to start domain Ubuntu_test

error: internal error : qemu unexpectedly closed the monitor: 2022-12-16T14:28:08.442593Z qemu-system-x86_64: -device vfio-pci,host=0000:81:00.2,id=hostdev0,bus=pci.7,addr=0x0: vfio 0000:81:00.2: group 96 is not viable.

Some pointers below:

Please ensure all devices within the iommu_group are bound to their vfio bus driver.

Does the dmesg/syslog file report 0000:81:00.2 being added to group 96?

Do you see this virtual function (0000:81:00.2) under /sys/kernel/iommu_groups/96/devices

Does your /proc/cmdline has iommu=pt & _iommu=on?

Is IOMMU at the BIOS set to ON, AUTO or DISABLED?

Are you getting the following warning “Warning: Your system has booted with the PCIe ACS Override setting enabled. The below list doesn’t not reflect the way IOMMU would naturally group devices.
To see natural IOMMU groups for your hardware, go to the VM Settings page and set the PCIe ACS Override setting to No”.

Sophie.

You can also configure SRIOV or PCIe passthrough as options.

Sophie.

Hi! Sorry for later reply, had a long vacation :) Best wishes for 2023!

Make sure baremetal (Hypervisor) and Vm’s are running our MLNX_OFED driver.

c1:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
	Subsystem: Mellanox Technologies ConnectX-5 Ex EN network interface card, 100GbE dual-port QSFP28, PCIe4.0 x16, tall bracket; MCX516A-CDAT
	Kernel driver in use: mlx5_core
	Kernel modules: mlx5_core
c1:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
	Subsystem: Mellanox Technologies ConnectX-5 Ex EN network interface card, 100GbE dual-port QSFP28, PCIe4.0 x16, tall bracket; MCX516A-CDAT
	Kernel driver in use: mlx5_core
	Kernel modules: mlx5_core
c1:00.2 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
	Subsystem: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
	Kernel driver in use: mlx5_core
	Kernel modules: mlx5_core
c1:00.3 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
	Subsystem: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
	Kernel driver in use: mlx5_core
	Kernel modules: mlx5_core
c1:00.4 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
	Subsystem: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
	Kernel driver in use: mlx5_core
	Kernel modules: mlx5_core
c1:00.5 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
	Subsystem: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
	Kernel driver in use: mlx5_core
	Kernel modules: mlx5_core
c1:00.6 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
	Subsystem: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
	Kernel driver in use: mlx5_core
	Kernel modules: mlx5_core
c1:00.7 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
	Subsystem: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
	Kernel driver in use: mlx5_core
	Kernel modules: mlx5_core
c1:01.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
	Subsystem: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
	Kernel driver in use: mlx5_core
	Kernel modules: mlx5_core
c1:01.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
	Subsystem: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
	Kernel driver in use: mlx5_core
	Kernel modules: mlx5_core

Our driver embed as well the supported FW.

Not sure where I can get the version information, however using mlxconfig I am able to set the SRIOV_EN and NUM_OF_VFS, followed by:

echo 4 > /sys/class/net/enp193s0f1np1/device/mlx5_num_vfs

to activate the VFs

The error you are getting (posted below) points to an issue with the vfio driver and iommu group and not related to SRIOV/Nvidia (non Nvidia developed).

error: Failed to start domain Ubuntu_test

error: internal error : qemu unexpectedly closed the monitor: 2022-12-16T14:28:08.442593Z >qemu-system-x86_64: -device vfio-pci,host=0000:81:00.2,id=hostdev0,bus=pci.7,addr=0x0: vfio >0000:81:00.2: group 96 is not viable.

Some pointers below:

Please ensure all devices within the iommu_group are bound to their vfio bus driver.

I will have to take a look into this.


Does the dmesg/syslog file report 0000:81:00.2 being added to group 96?
It does:

Jan 13 11:26:45 gc-hs1 kernel: [   30.351803] pci 0000:81:00.2: Adding to iommu group 96
Jan 13 11:26:45 gc-hs1 kernel: [   30.817903] pci 0000:81:00.3: Adding to iommu group 96
Jan 13 11:26:46 gc-hs1 kernel: [   32.000415] pci 0000:81:00.4: Adding to iommu group 96
Jan 13 11:26:48 gc-hs1 kernel: [   33.988886] pci 0000:81:00.5: Adding to iommu group 96
Jan 13 11:26:50 gc-hs1 kernel: [   35.557701] pci 0000:81:00.6: Adding to iommu group 96
Jan 13 11:26:50 gc-hs1 kernel: [   35.976102] pci 0000:81:00.7: Adding to iommu group 96
Jan 13 11:26:51 gc-hs1 kernel: [   36.429336] pci 0000:81:01.0: Adding to iommu group 96
Jan 13 11:26:51 gc-hs1 kernel: [   36.936331] pci 0000:81:01.1: Adding to iommu group 96
Jan 13 11:26:52 gc-hs1 kernel: [   37.785676] pci 0000:c1:00.6: Adding to iommu group 79
Jan 13 11:26:52 gc-hs1 kernel: [   38.204330] pci 0000:c1:00.7: Adding to iommu group 79
Jan 13 11:26:53 gc-hs1 kernel: [   38.627518] pci 0000:c1:01.0: Adding to iommu group 79
Jan 13 11:26:53 gc-hs1 kernel: [   39.061355] pci 0000:c1:01.1: Adding to iommu group 79
Jan 13 11:26:54 gc-hs1 kernel: [   39.685977] pci 0000:c1:00.2: Adding to iommu group 79
Jan 13 11:26:54 gc-hs1 kernel: [   40.110108] pci 0000:c1:00.3: Adding to iommu group 79
Jan 13 11:26:55 gc-hs1 kernel: [   40.575873] pci 0000:c1:00.4: Adding to iommu group 79
Jan 13 11:26:55 gc-hs1 kernel: [   41.043929] pci 0000:c1:00.5: Adding to iommu group 79

These are all the VF’s, spread over 2 ports and 2 network adapters (so total of 16 VFs)

Do you see this virtual function (0000:81:00.2) under /sys/kernel/iommu_groups/96/devices
I do see them indeed:

 # ls /sys/kernel/iommu_groups/96/devices
 0000:80:01.0  0000:81:00.0  0000:81:00.2  0000:81:00.4	0000:81:00.6  0000:81:01.0
 0000:80:01.1  0000:81:00.1  0000:81:00.3  0000:81:00.5	0000:81:00.7  0000:81:01.1

Does your /proc/cmdline has iommu=pt & _iommu=on?
I also do see these enabled:

# cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-5.4.0-136-generic root=UUID=0082baa9-9e77-4756-b183-6993b420bcca ro iommu=pt amd_iommu=on

Is IOMMU at the BIOS set to ON, AUTO or DISABLED?

Both SRIOV and IOMMU are set to a hard ENABLED.

Are you getting the following warning “Warning: Your system has booted with the PCIe ACS Override setting enabled. The below list doesn’t not reflect the way IOMMU would naturally group devices.
To see natural IOMMU groups for your hardware, go to the VM Settings page and set the PCIe ACS Override setting to No”.

I was unable to find anything that would indicate this warning message in DMESG/Syslog

A thing to add, I am currently assigning 1 VF to the VM. - I am aware that things being in the same IOMMU group could cause issues in some cases and I am wondering if that would be the case here?

Anything that could guide me into the right direction to get it to work is appreciated a lot!

Since other post is still under review, it maybe? out of order. Anyway:

I got it working, I got the whole card under the VM but this is wrong.

Since all VF’s and the whole card is within the same IOMMU group, it was so far the only way to do so. However I only need 1 VF port onto the VM. I assume the VF functionally supposed to be able to do this?

Additional information: The server uses an AMD EPIC Processor, and I am aware with AMD it may not always do IOMMU groups nicely, but it could also be that I am just missing the biggest flag to make them go into separate IOMMU’s.

Hope you can advice on this, thank you!

You can try this solution:
https://enterprise-support.nvidia.com/s/article/PCIe-AER-Advanced-Error-Reporting-and-ACS-Access-Control-Services-BIOS-Settings-for-vGPUs-that-Support-SR-IOV

Hi michaelsav,

Thank you, this looks very promising, I done some tests, was able to assign 2 VF’s separately now too to the VM, and assigned IP addresses for testing if I was able to reach the other side.

So far with these options (mainly AER and ACS enabled as others where already enabled) this looks very promising.

I have not yet been able to test RDMA yet, as I can do that on Friday.

So far, it looks very promising. I will mark this as a solution towards the main question (aka: how to get them in their own IOMMU).

Unfortunately I do hit a different issue as RDMA does not work.

# rping -c -Vv -C5 -a 10.44.44.162 -p 9999
Segmentation fault (core dumped)

[ 188.565871] rping[2090]: segfault at 0 ip 00007f8300347490 sp 00007f82fffdcbe0 error 4 in librdmacm.so.1.3.43.0[7f8300341000+15000]
[ 188.565883] Code: 00 4c 89 ff 44 8b 18 45 85 db 0f 84 8a 00 00 00 e8 d5 ca ff ff 89 45 14 85 c0 0f 85 8a 00 00 00 48 8b bd 50 01 00 00 48 8b 07 <48> 8b 00 44 8b 50 14 45 85 d2 0f 85 60 fa ff ff e8 8b be ff ff 89

ibv_devices

device          	   node GUID
------          	----------------
mlx5_0          	0000000000000000
mlx5_1          	0000000000000000
mlx5_2          	0000000000000000
mlx5_3          	0000000000000000

I found a post related to this at I am trying to setup CX5 RDMA in between two KVM guests (one in each physical node) but failed with rping segfault ... in librdmacm.so.1.1.20.2. Is there any articles here I can follow to resolve the issue?

Which is basically what I am also trying to do. However, I am running firmware 16.35.2000 (latest) yet I cannot set GUID.

As in this post, I also do get an operation not permitted MFE_CR_ERROR.

Do you have an alternative guide on setting the GUID? as links within this forum post are no longer working.

Thank you!

Hi All,

Managed to solve it. Documentation may not be fully up to date?

All I had to do was:
echo 00:11:22:33:44:55:1:0 > /sys/class/net/enp193s0f0np0/device/sriov/0/node

for each VF, in the documentation it was /sys/class/infiniband/mlx5_x - but here I could not find anything related to VF that was having infiniband/device/sriov/0/node (only the master/physical ports had this)

Thank you again for your support!

Can you please point the relevant section in the doc?
We will make sure to fix this.