I’m testing out ConnectX-3 Pro with VXLAN Offload in our lab. Using a single-stream iperf performance test, we get ~34Gbit/s transfer speed of non-VXLAN transport, but only ~28Gbit/s with VXLAN encapsulation.
In both cases, the bottleneck is the CPU on the receiving side. Looking at a perf dump, the top usage:
This is Centos 6.5, kernel 3.15.0, Firmware 2.31.5050.
We’re certainly happy with 28Gbit/s, but I’m wondering if there are plans to improve this to the point that VXLAN adds no additional CPU overhead at all, or if there is any tuning I can do towards the same goal?
PlumGrid and Mellanox published a new white paper about creating a better network infrastructure for a large-scale OpenStack cloud by using Mellanox’s ConnectX-3 Pro VXLAN HW offload.
The PlumGrid VNI (Virtual Network Infrastructure) running over Mellanox switches and ConnectX-3 Pro adapters is a unique offering targeted for large-scale data centers.
With the ConnectX-3 Pro stateless HW offload, users can achieve:
Linear improvement in VM performance until reaching the near line rate performance (36 Gbps with eight VM pairs generating traffic at maximum rates).
CPU utilization remains virtually constant on both TX and RX ends, while the throughput grows to 36 Gbps.
PlumGrid VNI 3.0 is a software networking product for large-scale OpenStack Clouds, it provides a network fabric-agnostic, turn-key solution to build a scalable cloud infrastructure and offer advanced, on-demand network services to cloud tenants. To find out more http://www.plumgrid.com/product/overview/http://www.plumgrid.com/product/overview/
ip link add vxlan0 type vxlan id 17 group 239.1.1.17 dev $DEV
ip addr add 172.18.1.$NET/24 brd + dev vxlan0
ip link set dev vxlan0 up
This is run on both machines (with different NET variable), bare metal with no VM. mlx4 is the ethX device renamed.
MTU 9000 is a new addition; with that I get ~38 Gbit/s when doing single-stream TCP testing on the mlx4 device, but VXLAN encapsulated traffic stays at ~24Gbit/s; CPU bound on a single core.
The performance I am seeing is close to the one you show in DOC-1456 for 1 VM pair. While I can get high performance by running multiple streams, I could get similar aggregate performance by bonding 4 10 Gbit/s connections. I’m really hoping to improve our single-stream speeds.