Poor infiniband performance on Vmware esxi 5.1

Hello,

I have a really weird problem with Infiniband connection between ESXi Hosts.

Here is my setup :

HP C7000 with BL685c G1 and HP 4x DDR IB Switch Module . The blades are running Vmware Esxi 5.1.0 U2 ( Custom HP image ), I have also installed Mellanox drivers ( MLNX-OFED-ESX-1.8.1.0 ) and ib-opensm on each of the hosts (Infiniband@home : votre homelab à 20Gbps - Hypervisor.fr Infiniband@home : votre homelab à 20Gbps - Hypervisor.fr ) . Here are the vmnics :

# esxcli network nic list | grep 10G

vmnic_ib0 0000:047:00.0 ib_ipoib Up 20000 Full 00:23:7d:94:d8:7d 4092 Mellanox Technologies MT25418 [ConnectX VPI - 10GigE / IB DDR, PCIe 2.0 2.5GT/s]

vmnic_ib1 0000:047:00.0 ib_ipoib Up 20000 Full 00:23:7d:94:d8:7e 1500 Mellanox Technologies MT25418 [ConnectX VPI - 10GigE / IB DDR, PCIe 2.0 2.5GT/s]

I have created a VMkernel port and a switch, both the group and switch are setup to deal with mtu=4k. I have also configured the mlx4_core to support mtu=4k

# esxcli system module parameters list -m=mlx4_core | grep mtu_4k

mtu_4k int 1 configure 4k mtu (mtu_4k > 0)

And here is the problem. When I am using MTU=1500

/opt/iperf/bin # ./iperf -s


Server listening on TCP port 5001

TCP window size: 64.0 KByte (default)


[ 4] local 192.168.13.39 port 5001 connected with 192.168.13.36 port 61140

[ ID] Interval Transfer Bandwidth

[ 4] 0.0-10.0 sec 3.98 GBytes 3.42 Gbits/sec

[ 5] local 192.168.13.39 port 5001 connected with 192.168.13.36 port 58854

[ 5] 0.0-10.0 sec 4.53 GBytes 3.89 Gbits/sec

[ 4] local 192.168.13.39 port 5001 connected with 192.168.13.36 port 51600

[ 4] 0.0-10.0 sec 3.66 GBytes 3.15 Gbits/sec

[ 5] local 192.168.13.39 port 5001 connected with 192.168.13.36 port 60066

[ 5] 0.0-10.0 sec 4.52 GBytes 3.88 Gbits/sec

[ 4] local 192.168.13.39 port 5001 connected with 192.168.13.36 port 50728

[ 4] 0.0-10.0 sec 4.71 GBytes 4.04 Gbits/sec

[ 5] local 192.168.13.39 port 5001 connected with 192.168.13.36 port 58792

[ 5] 0.0-10.0 sec 4.54 GBytes 3.90 Gbits/sec

MTU=2000

/opt/iperf/bin # ./iperf -s


Server listening on TCP port 5001

TCP window size: 64.0 KByte (default)


[ 4] local 192.168.13.39 port 5001 connected with 192.168.13.36 port 62523

[ ID] Interval Transfer Bandwidth

[ 4] 0.0-10.0 sec 5.35 GBytes 4.59 Gbits/sec

[ 5] local 192.168.13.39 port 5001 connected with 192.168.13.36 port 56491

[ 5] 0.0-10.0 sec 5.43 GBytes 4.66 Gbits/sec

[ 4] local 192.168.13.39 port 5001 connected with 192.168.13.36 port 63144

[ 4] 0.0-10.0 sec 4.41 GBytes 3.79 Gbits/sec

[ 5] local 192.168.13.39 port 5001 connected with 192.168.13.36 port 53978

[ 5] 0.0-10.0 sec 4.43 GBytes 3.81 Gbits/sec

[ 4] local 192.168.13.39 port 5001 connected with 192.168.13.36 port 61886

[ 4] 0.0-10.0 sec 5.38 GBytes 4.62 Gbits/sec

MTU=4092

/opt/iperf/bin # ./iperf -c 192.168.13.39


Client connecting to 192.168.13.39, TCP port 5001

TCP window size: 75.5 KByte (default)


[ 3] local 192.168.13.36 port 50673 connected with 192.168.13.39 port 5001

[ ID] Interval Transfer Bandwidth

[ 3] 0.0-79.5 sec 8.00 GBytes 864 Mbits/sec

/opt/iperf/bin # ./iperf -c 192.168.13.39


Client connecting to 192.168.13.39, TCP port 5001

TCP window size: 75.5 KByte (default)


[ 3] local 192.168.13.36 port 49604 connected with 192.168.13.39 port 5001

[ ID] Interval Transfer Bandwidth

[ 3] 0.0-79.5 sec 8.00 GBytes 864 Mbits/sec

/opt/iperf/bin # ./iperf -c 192.168.13.39


Client connecting to 192.168.13.39, TCP port 5001

TCP window size: 35.5 KByte (default)


[ 3] local 192.168.13.36 port 58764 connected with 192.168.13.39 port 5001

[ ID] Interval Transfer Bandwidth

[ 3] 0.0-79.5 sec 8.00 GBytes 864 Mbits/sec

All the testing has been done with iperf. Any suggestions why when the mtu is 4092 I get slower connection speeds than when I am using MTU=2000. AFAIK the speed has to increase when the mtu is higher ( I can see this trend from the difference between mtu=1500 and mtu=2000 ) .

Any input is welcome

  1. I do have a partition with a partitions.conf with the following content :

Default=0x7fff,ipoib,mtu=5:ALL=full;

  1. By specifications the HP DDR 4x IB Switch Module supports 4k MTU, but it’s not really manageable, so it might be the switch’s fault after all. I am waiting for a new Topspin switch, so if that is the problem it will be resolved.

3 and 4. I am going to update the drivers today, tho I can see the infiniband ports as ConnectX family adapters. So I guess the firmware and the HCA itself supports 4k. But still I will do some reserach in that direction too.

Thanks a lot. I will keep you guys updated and if anyone has any other ideas - please shoot. I really want to get this thing going so I can test the virtual storage.

A few ideas where to look:

  1. Most likely you do not have 4K MTU set on the IB fabric itself. You need to make sure opensm is set with MTU 4K, it is likely set to 2044 – default. If you just have one partition you need to add following line to /etc/opensm/partitions.conf and then restart opensm:

pkey0=0x7fff,ipoib,mtu=5 : ALL=full;

if your opensm runs on the switch – you will need to upload this file to the switch

  1. Your switch may or may not support 4K MTU

  2. Your card (quite old) and driver may or may not support 4K MTU. See page 18 at http://www.mellanox.com/related-docs/prod_software/Mellanox_IB_OFED_Driver_for_VMware_vSphere_User_Manual_Rev_1_8_2_4.pdf http://www.mellanox.com/related-docs/prod_software/Mellanox_IB_OFED_Driver_for_VMware_vSphere_User_Manual_Rev_1_8_2_4.pdf-

it says that “maximum value of JF supported by the InfiniBand device is: 2044 bytes for the InfiniHost III family and 4052 / 4092 bytes for ConnectX® IB family”

  1. In any case it also make sense to go with the latest

driver: http://www.mellanox.com/downloads/Drivers/MLNX-OFED-ESX-1.8.2.4-10EM-500.0.0.472560.zip http://www.mellanox.com/downloads/Drivers/MLNX-OFED-ESX-1.8.2.4-10EM-500.0.0.472560.zip