Cannot get 40Gbps on Ethernet mode with ConnectX-3 VPI

I have the Mellanox ConnectX-3 VPI IB/Ethernet (MCX354A-FCBT-A4 PCIe Gen3 x8), but I cannot get 40Gbps on Ethernet mode. No matter what I tried so far, I cannot exceed ~23Gbps when running iperf.

My test setup is as follows:

2 identical HP ProLiant DL360p Gen8 servers equipped with 2 Quad Core Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz CPUs and 32GB RAM. The OS is Ubuntu Linux 12.04.5 running the Kernel 3.2.0-70-generic and Mellanox OFED 2.3-1.0.1. In the BIOS settings, everything is set to the maximum performance.

The ConnectX-3 cards are connected back to back (no switch) with a Mellanox FDR copper cable (1m long).

The mlx4_core module is probed with the following options:

# The following two lines are added by the driver by default:

options mlx4_core fast_drop=1

options mlx4_core log_num_mgm_entry_size=-1

# The following line is added by me:

options mlx4_core num_vfs=16 port_type_array=2,2 probe_vf=0 enable_sys_tune=1

I have read the Performance Tuning Guidelines for Mellanox Network Adapters and I have made countless combinations of the suggested tunning parameters, but I cannot reach 40Gbps.

Here is a list of the typical commands I run in both servers for eth6 which is the port I use for my experiments:

~# ibdev2netdev

mlx4_0 port 1 ==> eth6 (Up)

mlx4_0 port 2 ==> eth7 (Down)

sysctl -w net.ipv4.tcp_timestamps=0

sysctl -w net.ipv4.tcp_sack=1

sysctl -w net.core.netdev_max_backlog=250000

sysctl -w net.core.rmem_max=4194304

sysctl -w net.core.wmem_max=4194304

sysctl -w net.core.rmem_default=4194304

sysctl -w net.core.wmem_default=4194304

sysctl -w net.core.optmem_max=4194304

sysctl -w net.ipv4.tcp_rmem=“4096 87380 4194304”

sysctl -w net.ipv4.tcp_wmem=“4096 65536 4194304”

sysctl -w net.ipv4.tcp_low_latency=1

sysctl -w net.ipv4.tcp_adv_win_scale=1

ethtool -K eth6 lro on

service irqbalance stop

NUMA_NODE=$(cat /sys/class/net/eth6/device/numa_node)

set_irq_affinity_bynode.sh $NUMA_NODE eth6

cat /sys/devices/system/node/node${NUMA_NODE}/cpulist

TASKSET_VAR=$(cat /sys/devices/system/node/node${NUMA_NODE}/cpumap | rev | cut -f 1 -d, | rev)

And at last, I run iperf like this:

taskset $TASKSET_VAR iperf -s # On server side (40.40.40.8)

taskset $TASKSET_VAR iperf -c40.40.40.8 -t43200 -i2 -P4 # On client side (40.40.40.7)

Some extra strange behavior I observed, is that when I try to use jumbo frames (MTU 9000 or 9600) I get worse performance compared to the default MTU setting (1500).

Any suggestion on what else shall I look for?

Hi,

Happy to see you were able to find the missing performance numbers

the performance tuning guide i provided above has the --set-priv-flags recommendation - see page 15.

there is also a kernel option mentioned in the procedure - check it out (for ipoib).

Cheers!

Hi,

i saw you applied a good set of performance tuning parameters but still go through the recommended once and make sure you didn’t leave anything impotent behind.

http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf

also, try to run your iperf client with -P (# of thread) equal to the number of processors in the machine.

Hi and thanks for the reply,

I have 8 cores in total on each server, but when I use more than 4 on iperf (e.g. -P 8), I get worse performance. I guess this is because I have two sockets, and each of them using its own NUMA cell.

I installed a CentOS 7 to experiment and with the same optimizations I was able to get 39.6Gbps! However, these two OSs (Ubuntu ) are completely different (Kernel 3.10 vs 3.2) and since CentOS 7 was released very recently, it has a much updated set of tools.

MTU 9000 helped a lot in this case as expected (contrary to what I saw on Ubuntu 12.04), but it was very unstable for each subsequent test and the results could vary from 20Gbps up to 39Gbps.

CentOS 7 is also using iperf3 and I was able to get this high speed only when I was using 1 thread (-P1) and the --zerocopy options (this option does not exist in the old version of IPerf coming with Ubuntu 12.04).

Using the following optimization, as described in the tuning guide, solved the problem or instabillity (or at least it made it much more stable… 29/30 times I will still get a speed of around 20Gbps on iperf)

ethtool --set-priv-flags <eth_if> mlx4_rss_xor_hash_function on

Unfortunately, the version of ethtool on Ubuntu 12.04 does not support the –set-priv-flags option, so I cannot test there.