CentOS 7 KVM-SR-IOV Performance?

Hello,

I have a big problem to get the right performance for KVM virtualization (SR-IOV) on top of CentOS7 Host.

I’m using ConnectX-3 card and MLNX_OFED_LINUX-3.0-2.0.1 version. The firmware is up-to-date (2.34.5000).

I’ve checked that MPI performance is OK for CentOS6.5 Host/bare-metal and CentOS6.5 KVM image (both Infiniband/Ethernet).

I’ve used exactly same OFED/firmware version and same applications for two CentOS6.5 and CentOS7 cases.

But, the KVM-SR-IOV performance is too bad on top CentOS7.

I tried different KVM-OS (CentOS6.5 and CentOS7 VM images) on top of CentOS7 host, but the result is same.

Between two CentOS7 host/bare-metal machines, MPI performance is OK.

Between two CentOS6.5 or CentOS7 KVM images on top of CentOS7 host, MPI performance becomes too bad in case of <=32KB message size.

For example, only 17% bandwidth is gotten within KVM-SR-IOV in case of 4KB MPI message size.

I’m using 3.10.0-229.4.2.el7.x86_64 kernel.

To enhance the KVM performance on top of CentOS7, I could upgrade kernel to v.4.0.1, but that kernel is not supported by OFED.

While following the section 3.12.1 of “Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf”, I tuned for Hypervisor.

The bios setup is totally same between CentOS6.5 test and CentOS7 test.

Is there any way to get the similar performance (bandwidth/latency) between two KVM-SR-IOV images on top of CentOS7 also? Any help is welcome!

Hi Mikyung,

I don’t see any obvious problems there, but IIRC there is a lot of config required to make this work… I guess for completeness you could show us how you’ve configured your host NICs and drivers and same for guests? And relevant flow control settings on your switch/es, host and guests (I believe you might need to be setting qos parameters in the guest if you are not already)? Are you using the same OFED inside the guests/VMs?

Cheers,

Even w/ maximum speed (cpufreq/scaling_governor: performance), I’m getting similar result: ib_write_bw performance (VM->host) is bad.

Following the output I would expect the same results in both cases, however A->B_vm is good, but B_vm->A is bad. Having ~3.5G in one direction shows that there is no issues with IB communiction, so may be it is how the ranks are bound on the VM or real host?

mpirun -np 2 -host A,B_vm : 3554.95 MB/s

mpirun -np 2 -host B_vm,A : 804.30 MB/s

Try to use ib_read_bw and ib_send_bw utilities before MPI. Also check that your CPU are running on the maximum speed, as seems that they are not - 1200 MHz/1995Mhz.

Thanks, Blair! I’ve setup exactly same OFED/FW/NIC/Switch/OS versions on each host/VM (CentOS6.5/7.1).

Host Driver Version … MLNX_OFED_LINUX-3.0-2.0.1 (OFED-3.0-2.0.0): modules

Firmware version: 2.34.5000

vlan 1000 is setup on Switch/NIC

Linux xxx 3.10.0-229.4.2.el7.x86_64 #1 SMP Wed May 13 10:06:09 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

6: ens2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT qlen 1000

link/ether e4:1d:2d:01:12:40 brd ff:ff:ff:ff:ff:ff

vf 0 MAC 00:00:00:00:00:00, vlan 4095, spoof checking off, link-state auto

vf 1 MAC 00:00:00:00:00:00, vlan 4095, spoof checking off, link-state auto

vf 9 MAC 5a:16:3e:6c:d9:2f, vlan 1000, spoof checking off, link-state auto

vf 15 MAC 00:00:00:00:00:00, vlan 4095, spoof checking off, link-state auto

07:01.2 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

Linux mk-test-inst1.novalocal 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

2: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT qlen 1000

link/ether 5a:16:3e:6c:d9:1f brd ff:ff:ff:ff:ff:ff

00:04.0 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

To check result on both 2 CentOS 6.5 hosts and 2 CentOS 7.1 hosts, I’m using same machines/setup. Even though I tried to make new VM images (CentOS 6.5/7.1) again, the bandwidth result is still bad. Should I have to do something more (that’s not needed on CentOS6.5 host) on CentOS7.1 to get reasonable bandwidth between 2VMs? Could you please explain the QoS parameters in the guest?? I tried same guest/VM (default QoS) on CentOS6.5 and CentOS7.1. Only the VM on CentOS7.1 have a problem.

Thanks, alkx!

I already tested w/ ib_read_bw and ib_send_bw also. I pasted sample output as follows. VM->Host is bad as expected in case of size<32K.

Let me check it again while changing the speed.

[1] hostB → vm@hostA

ib_write_bw -R -F $hostA_vm_IP -a

#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]

4096 5000 4313.10 4311.29 1.103690

[2] vm@hostA → hostB

ib_write_bw -R -F $hostB_IP -a

#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]

4096 5000 806.55 802.18 0.205359

Thanks, Blair! I pasted more detail configurations/result on host/vm here.

  • mpirun result (bandwidth, 4KB message size)
  • between 2 hosts

mpirun -np 2 -host A,B : 3798.62 MB/s

mpirun -np 2 -host B,A : 3790.37 MB/s

  • between 1 host and the other host’s VM

mpirun -np 2 -host A,B_vm : 3554.95 MB/s

mpirun -np 2 -host B_vm,A : 804.30 MB/s

mpirun -np 2 -host B,A_vm : 3433.93 MB/s

mpirun -np 2 -host A_vm,B : 834.83 MB/s

  • between 2 VMs on different host

mpirun -np 2 -host A_vm,B_vm : 796.67 MB/s

mpirun -np 2 -host B_vm,A_vm : 789.85 MB/s

  • A host

[root@A tmp]# lscpu

Architecture: x86_64

CPU op-mode(s): 32-bit, 64-bit

Byte Order: Little Endian

CPU(s): 16

On-line CPU(s) list: 0-15

Thread(s) per core: 1

Core(s) per socket: 8

Socket(s): 2

NUMA node(s): 2

Vendor ID: GenuineIntel

CPU family: 6

Model: 45

Model name: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz

Stepping: 7

CPU MHz: 1200.000

BogoMIPS: 3993.96

Virtualization: VT-x

L1d cache: 32K

L1i cache: 32K

L2 cache: 256K

L3 cache: 20480K

NUMA node0 CPU(s): 0-7

NUMA node1 CPU(s): 8-15

[root@A tmp]# numactl -H

available: 2 nodes (0-1)

node 0 cpus: 0 1 2 3 4 5 6 7

node 0 size: 24541 MB

node 0 free: 484 MB

node 1 cpus: 8 9 10 11 12 13 14 15

node 1 size: 24575 MB

node 1 free: 21446 MB

node distances:

node 0 1

0: 10 20

1: 20 10

  • B host

[root@B tmp]# lscpu

Architecture: x86_64

CPU op-mode(s): 32-bit, 64-bit

Byte Order: Little Endian

CPU(s): 16

On-line CPU(s) list: 0-15

Thread(s) per core: 1

Core(s) per socket: 8

Socket(s): 2

NUMA node(s): 2

Vendor ID: GenuineIntel

CPU family: 6

Model: 45

Model name: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz

Stepping: 7

CPU MHz: 1200.000

BogoMIPS: 3993.95

Virtualization: VT-x

L1d cache: 32K

L1i cache: 32K

L2 cache: 256K

L3 cache: 20480K

NUMA node0 CPU(s): 0-7

NUMA node1 CPU(s): 8-15

[root@B tmp]# numactl -H

available: 2 nodes (0-1)

node 0 cpus: 0 1 2 3 4 5 6 7

node 0 size: 24541 MB

node 0 free: 7483 MB

node 1 cpus: 8 9 10 11 12 13 14 15

node 1 size: 24575 MB

node 1 free: 23911 MB

node distances:

node 0 1

0: 10 20

1: 20 10

  • A host’s VM

[root@A_vm]# lscpu

Architecture: x86_64

CPU op-mode(s): 32-bit, 64-bit

Byte Order: Little Endian

CPU(s): 8

On-line CPU(s) list: 0-7

Thread(s) per core: 1

Core(s) per socket: 1

Socket(s): 8

NUMA node(s): 1

Vendor ID: GenuineIntel

CPU family: 6

Model: 13

Stepping: 3

CPU MHz: 1995.192

BogoMIPS: 3990.38

Hypervisor vendor: KVM

Virtualization type: full

L1d cache: 32K

L1i cache: 32K

L2 cache: 4096K

NUMA node0 CPU(s): 0-7

[root@A_vm]# numactl -H

available: 1 nodes (0)

node 0 cpus: 0 1 2 3 4 5 6 7

node 0 size: 15624 MB

node 0 free: 14788 MB

node distances:

node 0

0: 10

  • B host’s VM

[root@B_vm]# lscpu

Architecture: x86_64

CPU op-mode(s): 32-bit, 64-bit

Byte Order: Little Endian

CPU(s): 8

On-line CPU(s) list: 0-7

Thread(s) per core: 1

Core(s) per socket: 1

Socket(s): 8

NUMA node(s): 1

Vendor ID: GenuineIntel

CPU family: 6

Model: 13

Stepping: 3

CPU MHz: 1995.191

BogoMIPS: 3990.38

Hypervisor vendor: KVM

Virtualization type: full

L1d cache: 32K

L1i cache: 32K

L2 cache: 4096K

NUMA node0 CPU(s): 0-7

[root@B_vm]# numactl -H

available: 1 nodes (0)

node 0 cpus: 0 1 2 3 4 5 6 7

node 0 size: 15624 MB

node 0 free: 14791 MB

node distances:

node 0

0: 10

  • libvirt xml

[root@hp4 pt2pt]# cat /tmp/test.xml

ab14e717-90a9-4085-9a32-f0b24430b2c0

test

16000000

8

RDO Project

OpenStack Nova

2014.1.3-2.el7.centos

16353439-3339-5553-4532-333845585934

ab14e717-90a9-4085-9a32-f0b24430b2c0

hvm

The patches provided by RH from link http://people.redhat.com/~alwillia/bz1299846/ http://people.redhat.com/~alwillia/bz1299846/ is solving the issue.

I have downloaded and installed (yum install *.rpm) the 3 user space packages (qemu-img, qemu-kvm and qemu-kvm-common) on the hypervisor.

The performance could be enhanced by as much as 90% and 65% in the case of 1KB and 4KB message size respectively.

qemu-img-1.5.3-105.el7_2.1.bz1299846.0.x86_64.rpm

qemu-kvm-1.5.3-105.el7_2.1.bz1299846.0.x86_64.rpm

qemu-kvm-common-1.5.3-105.el7_2.1.bz1299846.0.x86_64.rpm

Performance Known Issues#783496: When using a VF over RH7.X KVM, low throughput is expected.

http://www.mellanox.com/related-docs/prod_software/Mellanox_OFED_Linux_Release_Notes_3_3-1_0_0_0.pdf http://www.mellanox.com/related-docs/prod_software/Mellanox_OFED_Linux_Release_Notes_3_3-1_0_0_0.pdf

Hi Mikyung,

Are you sure your issue is not related to other virtualisation factors? E.g., are you pinning your VMs to CPU cores and exposing the host NUMA topology to them? If your VMs have memory accesses that cross NUMA nodes (e.g., need to cross QPI) then that would explain your performance degradation as the message size increases and the effect of CPU caches is reduced to be dominated by memory.

Good luck!

Thanks for your help, Blair!

Yes, I added NUMA information into libvirt domain xml (libvirtd 1.2.8 / CentOS7.1).

In case of 4KB mpirun using 40G Ethernet, the result pattern is as follows:

  • hostA<->hostB (3797.05 MB/s)

  • hostA<->hostB’s VM (3521.03 MB/s)

  • hostA’s VM<->hostB’s VM (830.20 MB/s)

Thanks, Blair. Surely I checked the NUMA topology. I have two NUMA nodes and each has 8 cores. While pinning VM to different cores/NUMA nodes, I’ve checked the MPI Bandwidth Performance. In case of 1B~32KB, the performance is too bad (<17% of host result) even though maximum bandwidth is OK in case of >=64KB.

Hi Mikyung,

What does the NUMA topology look like inside your VMs, i.e., are you pinning memory nodes as well as CPUs? Do you have cpu numa xml elements in your libvirt domain xml, e.g., like:

?

Hi Mikyung,

Your figures make it look like there might be a problem with hostB’s VM… do you get the same results (hostA<->hostB’s VM (3521.03 MB/s)) when reversing to hostA’s VM<->hostB?

It might be useful if you dump more of your config here, e.g., lscpu/numactl -h on the hosts and inside the VMs, the libvirt xml etc.