Can't get RDMA working on a KVM-VM on AMD Epyc cores.

Hello,

I struggle to get the following setup running:

I have multiple AMD-epyc machines with Mellanox MCX556A-ECAT Cards (ConnectX-5).

I want to run a KVM virtual machine on these computers and have InfiniBand communication between different VMs (to get some MPI running).

Installation so far:

Host and Guest OS is Centos 7.7.

A Subnet Manager with virt_enabled 2 is running on one of the host systems.

I followed the steps in this guide.

I installed MLNX_OFED_LINUX-5.0-1.0.0.0-rhel7.7-x86_64 drivers on the hosts, enabled SR-IOV in the BIOS and set the kernel boot parameters amd_iommu=on and iommu=pt.

SR-IOV is enabled for the cards and I set /sys/class/infiniband/mlx5_1/device/mlx5_num_vfs to 2, /sys/class/infiniband/mlx5_1/device/sriov/XXX/policy is set to follow and each VF gets a node and port guid.

Then I unbind and bind the driver. This results in a set of vf-PCI devices as expected.

I then attach one of the vf-PCI devices to the KVM virtual machine via virsh and boot the VM. I can see the PCI device and installed MLNX_OFED_LINUX in the VM as well. Then I set up IPoIB and this works quite well.

Problem:

At this point, I can use IPoIB and I can also successfully use ibping. I can also use a tool like ib_write_bw from one of the Hosts to one of the VMs. But when I use these tools from one VM to another host or another VM, I get the errors below. Higher level tools like libverbs also don’t work.

Output VM (ib_write_bw client)

ib_write_bw host0 -x 0


RDMA_Write BW Test

Dual-port : OFF Device : mlx5_0

Number of qps : 1 Transport type : IB

Connection type : RC Using SRQ : OFF

PCIe relax order: ON

TX depth : 128

CQ Moderation : 1

Mtu : 4096[B]

Link type : IB

GID index : 1

Max inline data : 0[B]

rdma_cm QPs : OFF

Data ex. method : Ethernet


ethernet_read_keys: Couldn’t read remote address

Unable to read from socket/rdam_cm

Failed to exchange data between server and clients

Output other Host (ib_write_bw server)

ib_write_bw -d mlx5_1


  • Waiting for client to connect… *


RDMA_Write BW Test

Dual-port : OFF Device : mlx5_1

Number of qps : 1 Transport type : IB

Connection type : RC Using SRQ : OFF

PCIe relax order: ON

CQ Moderation : 1

Mtu : 4096[B]

Link type : IB

Max inline data : 0[B]

rdma_cm QPs : OFF

Data ex. method : Ethernet


local address: LID 0x0b QPN 0x090e PSN 0x251319 RKey 0x00256e VAddr 0x007f4e0629a000

ethernet_read_keys: Couldn’t read remote address

Unable to read to socket/rdam_cm

Failed to exchange data between server and clients

Output ibstat on VM

ibstat

CA ‘mlx5_0’

CA type: MT4120

Number of ports: 1

Firmware version: 16.27.1016

Hardware version: 0

Node GUID: 0x1122334400000100

System image GUID: 0x506b4b03000cef94

Port 1:

State: Active

Physical state: LinkUp

Rate: 56

Base lid: 4

LMC: 0

SM lid: 11

Capability mask: 0x2651ec48

Port GUID: 0x1122334400000101

Link layer: InfiniBand

Any help is appreciated 🙂!

Best Regards

Jonathan

Hi Jonathan,

How many baremetal servers (KVM) and how many VM’s per baremetal servers?

Is the Mellanox driver on the VM’s same as the baremetal servers, 5.0?

How many connectX-5 HCA cards per baremetal servers?

How many VF’s per VM’s?

Can you use ib_write_bw between baremetal servers?

Can you use ib_write_bw between VM’s belonging to the same baremetal server?

Can you use ib_write_bw between VM’s belonging to different baremetal servers?

same 3 questions above but this time with ping?

can you run any of the tests above adding option -R to the ib_write_bw (client/server)?

Do you use these tests using the ipoib ip addrs from the IB interfaces?

Are you getting the same results with ib_read_bw & ib_write_bw?

Can you confirm if the subnet prefix is the same for both the IP’s? (as applicable)

is firewall disabled?

is SELINUX disabled?

is the perftest package the same version/build?

are all ib interfaces part of the same subnet?

Those are pointers however, I would suggest opening a case

at Mellanox to further investigate/troubleshoot (supportadmin@mellanox.com)

Sophie.

one more thing, the utility uses default port 18515, is this port free? if not, you can use -p to use a different port. (ib_write_bw --help)

Sophie.

Thank you very much for your long Answer! I tried to answer all of your questions. I will follow your suggestion and write an email to the support.

> How many baremetal servers (KVM) and how many VM’s per baremetal servers?

Well, at the moment 2 servers with 1 VM per Server. If this works, maybe more later on…

> Is the Mellanox driver on the VM’s same as the baremetal servers, 5.0?

Yes. The output of mst version is “mst, mft 4.14.0-105, built on Feb 27 2020, 13:41:03. Git SHA Hash: 52fcec8” on the VM as well as on the baremetal.

> How many connectX-5 HCA cards per baremetal servers?

1 card per server

> How many VF’s per VM’s?

For the moment 4 VF’s.

> Can you use ib_write_bw between baremetal servers?

Yes, this works

> Can you use ib_write_bw between VM’s belonging to the same baremetal server?

No, same problem.

> Can you use ib_write_bw between VM’s belonging to different baremetal servers?

No, that is exactly what I want to achive and it doesn’t work.

> same 3 questions above but this time with ping?

This works in all configurations. Same with IPoIB.

> can you run any of the tests above adding option -R to the ib_write_bw (client/server)?

  • baremetal → baremetal: Works
  • vm → baremetal, vm → vm, baremetal → vm: No connection (Hangs up infinitely). A retry on client side often throws the following error:

*> ib_write_bw -R 10.0.0.11*

*Unexpected CM event bl blka 8*

*Unable to perform rdma_client function*

*Unable to init the socket connection*

> Do you use these tests using the ipoib ip addrs from the IB interfaces?

Yes, I did all tests using both, ethernet ip and IPoIB ip

> Are you getting the same results with ib_read_bw & ib_write_bw?

Yes, here is the results again for ib_read_bw:

  • baremetal → baremetal: Works
  • vm → baremetal: hangs (same error as initially described - Unable to read to socket/rdam_cm).
  • baremetal → vm: Works
  • vm → vm: hangs (same error as initially described - Unable to read to socket/rdam_cm)

All tests with -R:

  • baremetal → baremetal with -R: Works
  • vm → baremetal, vm → vm, baremetal → vm: No connection (Hangs up infinitely). A retry often throws the same Error as on ib_write_bw.

> Can you confirm if the subnet prefix is the same for both the IP’s? (as applicable)

Yes, Ethernet IPs share a subnet and the IPoIB IPs share another subnet.

> is firewall disabled?

firewalld is disabled and there is no other firewall on the network.

> is SELINUX disabled?

Yes, it is disabled

> is the perftest package the same version/build?

Yes, they are all on Version: 5.78.1

> are all ib interfaces part of the same subnet?

Yes, all are connected to the same IB-switch and there should be no other subnet present. However, there two SM running: One on the switch with lowest priority, and one on one of the barebone servers.

> one more thing, the utility uses default port 18515, is this port free? if not, you can use -p to use a different port. (ib_write_bw --help)

Yes, the port is free. Using another port does not make a difference.

Hi Jonathan,

> Is the Mellanox driver on the VM’s same as the baremetal servers, 5.0?

Yes. The output of mst version is “mst, mft 4.14.0-105, built on Feb 27 2020, 13:41:03. Git SHA Hash: 52fcec8” on the VM as well as on the baremetal.

I was referring to the Mellanox Driver ie: ofed_info -s, make sure the Mellanox Driver has been installed on the VM’s Guest and same version as the baremetal servers. You want to make sure the perftest utilities are running the same version due to interoperability issue.

Sophie.