ConnectX-3 Pro VXLAN Performance Overhead

Hi,

I’m testing out ConnectX-3 Pro with VXLAN Offload in our lab. Using a single-stream iperf performance test, we get ~34Gbit/s transfer speed of non-VXLAN transport, but only ~28Gbit/s with VXLAN encapsulation.

In both cases, the bottleneck is the CPU on the receiving side. Looking at a perf dump, the top usage:

Without VXLAN:

  • 24.27% iperf [kernel.kallsyms] [k] copy_user_enhanced_fast_string

  • 6.49% iperf [kernel.kallsyms] [k] mlx4_en_process_rx_cq

  • 5.34% iperf [kernel.kallsyms] [k] tcp_gro_receive

  • 3.43% iperf [kernel.kallsyms] [k] dev_gro_receive

  • 3.28% iperf [kernel.kallsyms] [k] mlx4_en_complete_rx_desc

  • 3.05% iperf [kernel.kallsyms] [k] memcpy

  • 2.88% iperf [kernel.kallsyms] [k] inet_gro_receive

With VXLAN:

  • 20.06% iperf [kernel.kallsyms] [k] copy_user_enhanced_fast_string

  • 6.04% iperf [kernel.kallsyms] [k] mlx4_en_process_rx_cq

  • 5.43% iperf [kernel.kallsyms] [k] inet_gro_receive

  • 3.29% iperf [kernel.kallsyms] [k] dev_gro_receive

  • 3.24% iperf [kernel.kallsyms] [k] tcp_gro_receive

  • 3.08% iperf [kernel.kallsyms] [k] skb_gro_receive

  • 3.02% iperf [kernel.kallsyms] [k] memcpy

  • 2.85% iperf [kernel.kallsyms] [k] mlx4_en_complete_rx_desc

This is Centos 6.5, kernel 3.15.0, Firmware 2.31.5050.

We’re certainly happy with 28Gbit/s, but I’m wondering if there are plans to improve this to the point that VXLAN adds no additional CPU overhead at all, or if there is any tuning I can do towards the same goal?

  • Thorvald

About PlumGrid:

PlumGrid and Mellanox published a new white paper about creating a better network infrastructure for a large-scale OpenStack cloud by using Mellanox’s ConnectX-3 Pro VXLAN HW offload.

The PlumGrid VNI (Virtual Network Infrastructure) running over Mellanox switches and ConnectX-3 Pro adapters is a unique offering targeted for large-scale data centers.

With the ConnectX-3 Pro stateless HW offload, users can achieve:

  • Linear improvement in VM performance until reaching the near line rate performance (36 Gbps with eight VM pairs generating traffic at maximum rates).

  • CPU utilization remains virtually constant on both TX and RX ends, while the throughput grows to 36 Gbps.

The white paper is available from Plumgrid website page: http://www.plumgrid.com/wp-content/uploads/documents/PLUMgrid_Mellanox_WP.pdf http://www.plumgrid.com/wp-content/uploads/documents/PLUMgrid_Mellanox_WP.pdf

PlumGrid VNI 3.0 is a software networking product for large-scale OpenStack Clouds, it provides a network fabric-agnostic, turn-key solution to build a scalable cloud infrastructure and offer advanced, on-demand network services to cloud tenants. To find out more http://www.plumgrid.com/product/overview/ http://www.plumgrid.com/product/overview/

#! /bin/bash

set -x

DEV=mlx4

NET=21

ip addr flush dev mlx4

ip link set dev mlx4 down

ip link del vxlan0

ip link set dev $DEV mtu 9000

ip addr add 10.224.$NET.27/24 brd + dev $DEV

ip link set dev $DEV up

ip route add 10.224.0.0/12 via 10.224.$NET.1

ip link add vxlan0 type vxlan id 17 group 239.1.1.17 dev $DEV

ip addr add 172.18.1.$NET/24 brd + dev vxlan0

ip link set dev vxlan0 up

This is run on both machines (with different NET variable), bare metal with no VM. mlx4 is the ethX device renamed.

MTU 9000 is a new addition; with that I get ~38 Gbit/s when doing single-stream TCP testing on the mlx4 device, but VXLAN encapsulated traffic stays at ~24Gbit/s; CPU bound on a single core.

The performance I am seeing is close to the one you show in DOC-1456 for 1 VM pair. While I can get high performance by running multiple streams, I could get similar aggregate performance by bonding 4 10 Gbit/s connections. I’m really hoping to improve our single-stream speeds.

Hi Thorvald,

Did you run this test VM to VM or within the hypervisor, I assume VM to VM.

Is this only one flow (one VM) or more (several VMs on the same host)?

What is the CPU that you are using? number of cores? memory?

Do you use PCIe Gen3? (I assume you do)

Do you use MTU=1500?

If possible, try to run 2 or 4 VMs and see how it goes, it should be better.

The performance looks ok, but you could reach to better ones (close to line rate)

See this post:Infrastructure & Networking - NVIDIA Developer Forums https://community.mellanox.com/s/article/vxlan-considerations-for-connectx-3-pro

I added a performance slide, and a link to Case Study with Plumgrid

http://www.plumgrid.com/wp-content/uploads/documents/PLUMgrid_Mellanox_WP.pdf http://www.plumgrid.com/wp-content/uploads/documents/PLUMgrid_Mellanox_WP.pdf

Thanks,

Ophir.