IPoIB Performance - ESXi 5.1 U1

Hi all

First of all I’d just like to say I think its excellent that Mellanox provides a forum for IB home labs / hobby, that is very good service.

For me, my home lab is a way of testing out “new” solutions before I consider recommending them - I like to have full confidence in my recommendations and of course its a great way to push your skills where you may not be able to do in a corporate / budget environment.

Anyway, I am very new to IB but with it becoming prominent in the Big Data market and also VSANs (storage in the cabinet); I wanted to see what it was all about.

I have purchased the following (I reliase its not cutting edge, but early next year I will be upgrading to a QDR / 40Gbps Mellanox IB switch with built-in SM) assuming I can get this working the way I expect:

1 x Voltaire GridDirector ISR 9024D (not the M model)

2 x MHGH28-XTC (Rev X1) HCA cards - I flashed these to firmware version 2.7000

2 x CX4 cables

2 x VMware ESXi custom systems

2 x Intel 335 SSDs (500MB/s each) - in 2 weeks this will become 4 x Intel 335 SSDs (providing theoretical 2Gbps ish IO in RAID-0)

Ok, I have installed the relevant drivers - for sake of a simple guide (which can be corrected if you think I have missed / done something wrong) here is what I did:


[ INFINIBAND ]

  1. Install the Mellanox OFED drivers

esxcli system module paramters set -m=mlx4_core -p=mtu_4k=1

esxcli software vib install -d /tmp/mlx4_en-mlnx-1.6.1.2-offline_bundle-471530.zip –no-sig-check

esxcli software vib install -d /tmp/MLNX-OFED-ESX-1.8.2.0.zip

Installation Result

Message: The update completed successfully, but the system needs to be rebooted for the changes to be effective.

Reboot Required: true

VIBs Installed: Mellanox_bootbank_net-ib-cm_1.8.2.0-1OEM.500.0.0.472560, Mellanox_bootbank_net-ib-core_1.8.2.0-1OEM.500.0.0.472560, Mellanox_bootbank_net-ib-ipoib_1.8.2.0-1OEM.500.0.0.472560, Mellanox_bootbank_net-ib-mad_1.8.2.0-1OEM.500.0.0.472560, Mellanox_bootbank_net-ib-sa_1.8.2.0-1OEM.500.0.0.472560, Mellanox_bootbank_net-ib-umad_1.8.2.0-1OEM.500.0.0.472560, Mellanox_bootbank_net-mlx4-core_1.8.2.0-1OEM.500.0.0.472560, Mellanox_bootbank_net-mlx4-ib_1.8.2.0-1OEM.500.0.0.472560, Mellanox_bootbank_scsi-ib-srp_1.8.2.0-1OEM.500.0.0.472560

VIBs Removed:

VIBs Skipped:

esxcli software acceptance set --level=CommunitySupported

esxcli software vib install -v /tmp/ib-opensm-3.3.16.x86_64.vib --no-sig-check

  1. Reboot

  2. Fix MTU and partitions.conf

vi /tmp/partitions.conf

Default=0x7fff,ipoib,mtu=5:ALL=full;

cp partitions.conf /scratch/opensm/0x001a4bffff0c1399/

cp partitions.conf /scratch/opensm/0x001a4bffff0c139a/

  1. Flashed both HCA cards to firmware 2.7000

5.Created a virtual network in ESXi using one port on the HCA each (per ESXi system) - ESXi recognises this vnic as up and 20Gbps

  1. TRIED to set the MTU > 2k but failed, won’t go higher than 2k in the vswitch.

  2. Created 2 x WIN7 systems each with 2x4GHz vCPUs, 8GB RAM, 1 x SSD based HDD (theoretical 500MB/s or slightly less IO - no other VM using this SSD datastore) and configured NICs using IP on the IPoIB vswitch same subnet, ping works etc

  3. Copied a 3.6GB ISO from WIN701 to WIN702 - 289Mbps (15secs) - thats fast but I was expecting more throughput

  4. Created a 4GB RAM disk on each system

  5. Re-copied the above file, result: 360MB/sec


I was expecting much quicker than this copy rates, especially via RAMdisk - are there any areas that you can suggest I look at as this is not performing at the level I’d expect.

Thanks

Nobody got any input at all?

If its the hardware, HCA cards etc I can buy newer but I wanted to test this before considering buying a 4036 Voltaire GD - otherwise I’m kinda stuck as to how to improve this.

Hi Ingvar

Thank you very much for your response, you are getting some nice speeds there - would you mind sharing the iperf command you are using to test this and also describe your VMware configuration if you don’t mind? (virtual machine network / vswitch and VM(s) config) just purely so I can emulate your setup.

I was disappointed with the disk speeds, but I will work at it as it was only a first test; however first I would like to verify that the setup is correct and so far using a RAMdisk this is a no.

Thanks once again

HI. We also noted that it is not possible to use more than 2044 MTU on the ESXi host in Vsphere 5.1u1

To compare your performance with ours:

  • 2 linux VMs on 2 separate ESXi hosts,

  • MTU size 2044 on the virtual nics and on both the vmknics (IPoIB) on the hosts

  • iperf is the testing tool (since it does not involve disk access, just shuffles the data in memory)

  • The speed we got was about 9-9,5 Gbits/s

  • If using MTU=1500, the speed dropped to ca 7 Gbit/s

A Vmotion of a VM took typically 4 sec

Note that the speed gained is of course dependant to the type of hardware you have (cpu speed, #cores, pci bus type etc) In our setup we have QDR speed on the IB switches.

As for IPoIB, you would at least get 10-12 Gb/s on a physical linux box (4k MTU).

One more thing: if you test to read from a server/nas using disk access, it can be a good advice to skip the WRITE operations on the receiving end.

Just output the result to /dev/null like:

(the command will find all files, execute ‘cat’ and redirect output to null point it to a location with LARGE files like ISO/DVDs to get the best result)

nfs mount a share on the NAS as /mnt/remote

cd /mnt/remote/

find . -type f -print -exec sh -c ‘cat “$1” >/dev/null’ {} {} ;

I used this meaasure to verify the speed from my NAS which could not run iperf.

Regards, Ingvar

In the first iperf test run you got 7,83 Gb/s, which isn’t that bad

The other two tests using the "-M and “-G” switches I havnt tried. (M= TCP_MAXSEG SIZE) but what is the “G” option?

We got two IB ports from each ESXi connected to the switches, I’m unsure if it helps to speed things up when just running one vmguest in the host (standard RoundRobin set-up on the dvswitch up-links. I guess all trafic from a single test/tcp connection will use the same vmknic.

For disk access in this setup/test we actually used an NFS mount to a NAS with 12 disks and 2*10Gb ethernet interfaces connected via a MX6036 with Ethernet-to-IB gateway. So it was not connected directly with IB interfaces.

When doing the Read test (described in my previous post: ’ find . -f …cat >/dev/nul …') from a vmguest we got about 3-4 Gb/s. I think we hit the performance celing in this specific NAS. We could have run I/O meter of course to verify the speed, but currently we have no Windows boxes installed for this test. (never played with the linux version of IOMeter)

Another option whould be to run an NFS NAS server on IPoIB (dont have) or SRP (neither) to speed things up.

I guess SRP whould be the best option since it does not use the IP stack at all, just RDMA directly to the target. I have no experience at all with SRP, but it is mentioned in the release notes for 1.8.1.

Someone else having experiences with SRP on Vmware , willing to share experiences?

SRP on vSphere 5.1, 5.5 shows throughput performance about 3.2-3.5GB/s with single QDR CX-1 HCA.

The go-to-place for iperf is

http://iperf.fr/ http://iperf.fr/

(win/linux/macOS/solaris and source code download)

There is no ESXi version available, so thats why we did our pert tests from a VM-guest instead

The iperf tests always takes two boxes, one running the server and the other as client

Server side:

iperf -s

Client side

iperf -c

You can add the following

-t 100 run in 100 seconds

-i 5 print the result every 5 seconds

-d dual direction test

-P 2 run 2 parallel tests (just pick any number, the default is 1)

Remember to turn off /adjust the firewall settings to allow incoming traffic on port 5001

Linux

If you run Centos/Fedora style of distro, add the EPEL repository and install iperf from there.

http://www.rackspace.com/knowledge_center/article/installing-rhel-epel-repo-on-centos-5x-or-6x http://www.rackspace.com/knowledge_center/article/installing-rhel-epel-repo-on-centos-5x-or-6x

There you can also find “nload” which is a nice tool for just checking the current I/O performance

Start nload with

nload ib0 (if a physical box using IPoIB)

or

nload eth0 (if a vmGuest using eth0 as first nic)

For our VMware setup, we have blade chassis system with built-in Mellanox Infiniscale-IV QDR switches and a 4036 on top

The cabling is QDR on QFSP

The HCAs are Mellanox MT26428 [ConnectX VPI - 10GigE / IB QDR, PCIe 2.0 5GT/s]

The ESXi hosts has 2 CPU with 8 cores/2T each A total of 32 logical processors (E5-2670)

ESXi 5,1,0 build 799733

IPoIB nics configured as uplinks to a dvSwitch

The VMguests are 2vCPU /2GB ram installed with Centos 6.4 64bit

Just let me know if there are any other vmware specific settings/values to compare. We just installed the MX Ofed for vmware according with the manual.

Here is another test using Linux VMs:

[root@ib-lnx-01 ~]# rsync -av --progress /tmp/SM-6.3.2.0.632023-e50-00_OVF10.ova root@192.168.0.116:/tmp

root@192.168.0.116’s password:

sending incremental file list

SM-6.3.2.0.632023-e50-00_OVF10.ova

3370676736 100% 217.62MB/s 0:00:14 (xfer#1, to-check=0/1)

^[[28~

sent 232340 bytes received 464511 bytes 16019.56 bytes/sec

total size is 3370676736 speedup is 4837.01

So I’m getting under the performance of a 2Gbps connection, does anyone have any ideas or is this the limit on the ESXi driver / IPoIB implementation using my technology?

Hi Ingvar

I really appreciate you taking the time out to post what you have above - thanks, I will try this later once I finish work for the day and come back with my results.

I will use the same distro as yourself (good choice btw :-) ) but I will not be using a distributed switch, though that should not make any difference.

I will post again later.

Thanks

Ok now thats interesting, using “iperf” I am seeing the following performance:

iperf -c 10.0.0.20

Interval 0.0-10.0 sec

Transfer: 9.12 GBytes

Bandwidth: 7.83 Gbits/sec

iperf -c 10.0.0.20 -G

Interval: 0.0-10.0 sec

Transfer: 9.08 GBytes

Bandwidth: 0.91 GBytes/sec

iperf -c 10.0.0.20 -M

Interval: 0.0-10.0 sec

Transfer: 9330 MBytes

Bandwidth: 933 MBytes/sec

This is between two independent ESXi 5.1 hypervisors hosting two independent VMs:

VM #1 (Server) - CentOS x64 6.4 // 2vCPUs (4Ghz each) // 2GB RAM (DDR3 PC3-10666C9 1333MHz Dual Channel)

VM #1 (Client) - CentOS x64 6.4 // 2vCPUs (3.2Ghz each) // 2GB RAM (DDR3 PC3-10666C9 1333MHz Dual Channel)

ESXi Config:

Dedicated vSwitch with Virtual Machine Network (no routing) containing vmnic_ib0 20000 Full MTU=2044

ESXi 1 & 2 ConnectX HCA CX4 cables into Voltaire GridDirector ISR9024D (1xUplink per ESXi host)

Not sure what kind of ceiling in terms of performance I should expect from this kind of setup / test however - it looks about right, but disappointed with the disk performance - I need to work more on that. What kind of performance did you get with your NAS? how many spindles / speeds / size etc? (if you don’t mind me asking) - thanks!