Windows ConnectX-3 High Latency

I am having trouble with latency on a Windows Server IB setup. I have read and tried many troubleshooting steps on various forums and the Mellanox Performance Tuning Guide but have had no luck in getting my setup to the low latency that I imagine the hardware is capable of. I am a complete beginner in the IB space so it is very possible I am missing some very basic config or concept.

I guess it is also possible that 0.2ms to 0.4ms ping latency and 2.5ms additional file access latency is to be expected, please let me know if my expectations were off.

Any advice would be greatly appreciated.

==== Hardware/Software Setup ====

HOST01:

Adapter: MCX354A-FCBT

Firmware: 2.40.7000

Motherboard: SuperMicro X8DT3

PCIe Slot: Gen2 8x (No Gen3 available)

OS: Windows Server 2016 DC

Mellanox Port Mode: IB

IP: 10.255.255.10

HOST02:

Adapter: MCX354A-FCBT

Firmware: 2.40.7000

Motherboard: SuperMicro X8DTN+

PCIe Slot: Gen2 8x (No Gen3 available)

OS: Windows Server 2016 DC

Mellanox Port Mode: IB

IP: 10.255.255.20

NETWORK:

Direct Connect (Back to Back): Mellanox MC2207130-002

==== VSTAT ====

Here are the vstat outputs for each host.

HOST01: COMMAND: vstat.exe

RESULT: https://pastebin.com/raw/aCk0Ltp4 https://pastebin.com/raw/aCk0Ltp4

HOST02: COMMAND: vstat.exe

RESULT: https://pastebin.com/raw/L0zWr7jk https://pastebin.com/raw/L0zWr7jk

==== Ping Times ====

The ping times seem very high (0.32ms to 0.45ms) In fact they are the same as my 1GB Ethernet connection.

HOST01: COMMAND: hrping.exe 10.255.255.20

RESULT: https://pastebin.com/raw/ubV0ZnsF https://pastebin.com/raw/ubV0ZnsF

HOST02: COMMAND: hrping.exe 10.255.255.10

RESULT: https://pastebin.com/raw/QVTYKkXQ https://pastebin.com/raw/QVTYKkXQ

**Note: ibping produces slightly better times of 0.19ms to 0.24ms

==== Throughput ====

The throughput seems “fine” I guess. I read various sources that said 56Gbps is limited to lower real world throughput for various reasons. In any case, I am not too concerned with throughput since I am focused on IOPS.

HOST01: COMMAND: ntttcp.exe -r -m 28,*,10.255.255.10 -rb 2M -a 16 -t 5

RESULT: https://pastebin.com/raw/RmQSBL2G https://pastebin.com/raw/RmQSBL2G

HOST02: COMMAND: ntttcp.exe -s -m 28,*,10.255.255.10 -l 512K -a 2 -t 5

RESULT: https://pastebin.com/raw/djsVFs8R https://pastebin.com/raw/djsVFs8R

==== IOPS ====

So, finally to my actual problem. I have a disk on HOST01 that locally has ~90K IOPS with 0.3ms latency but over the network it is down to ~10K IOPS and up to 3.0ms+ latency.

HOST01 (Local): COMMAND: diskspd.exe -b8K -d30 -o4 -t8 -h -r -w0 -L -Z1G -c20G x:\share\iotest.dat

RESULT: https://pastebin.com/raw/hikPjDQs https://pastebin.com/raw/hikPjDQs

HOST02 (Over IB): COMMAND: diskspd -b8K -d30 -o4 -t8 -h -r -w0 -L -Z1G -c20G \10.255.255.10\shared\iotest.dat

RESULT: https://pastebin.com/raw/e07Ajx1i https://pastebin.com/raw/e07Ajx1i

Hi Sam,

Initially I see two issues with your setup. Your PCIe slots and the configuration of your diskspd tests.

Maximum throughput on a PCIe 2.0 x8 is ~32Gbit/s

PCIe 3.0 x8 is roughly double at ~64Gbit/s - So you’re PCIe 2.0 x8 slots are going to be the bottleneck when trying to push 40Gb/s let alone 56Gb/s speeds.

As for latency, 0.3ms on the network is fine - I have similar latency as you.

But for the diskspd tests you’ll need to provide more information, are jumbo frames enabled? RDMA enabled? Also, are you writing to the same type of disk on both tests? It seems like it’s more an issue with SMB configuration on the host than anything to do with the adaptors.

If you want to have a look at adaptor performance, I’d recommend using a tool such as iPerf.

I didn’t realize the bottleneck existed as far as throughput goes on gen2 vs gen3. Thanks for pointing that out, but in this case I think I’m going to have to leave it since the servers are in place and upgrading motherboards is out of scope for this project at this point. But good to know even more performance can be had.

As for your questions and suggestions:

1.) Jumbo Frames are enabled at 4092 on each adapter.

2.) RDMA is enabled and working - verified by using perfmon and adding RDMA counters.

3.) The disk is the same exact disk on both tests - just shared over SMB on the remote test and accessed via drive letter on the local test.

4.) I did new tests using iPerf but as far as I can tell it cannot test IOPS or latency. I did my original tests with NTTTCP because that is what was recommended in various Mellanox documents when testing on Windows. In any case, both iPerf and NTTTCP show the same results of just under 30Gbps which would seem to be the limit of the gen2 8x slots plus some overhead.

On the latency subject - I am a bit confused - and it very well could be that I am just missing something. You say that the 0.3ms latency is fine, but that is the same latency I have with my standard 1GBE nics. I was under the impression that IB had much lower latency than standard 1GB Ethernet. Is ping not the correct way to see the lower latency that IB features?

Hi Sam,

Diskspd is fine for testing iops, I actually skipped over your NTTTCP results and only saw them after I posted, but they look as expected. As for testing the lowest latency possible, ping probably isn’t the best tool for this. Something like netperf, if you can find a Windows port of it, is a good tool to use for getting good numbers - Here is a good article to have a read http://www.mellanox.com/related-docs/whitepapers/HP_Mellanox_FSI%20Benchmarking%20Report%20for%2010%20&%2040GbE.pdf http://www.mellanox.com/related-docs/whitepapers/HP_Mellanox_FSI%20Benchmarking%20Report%20for%2010%20&%2040GbE.pdf

I’ve run a diskspd test over SMB to another server with underlying iSCSI disk (so essentially going out 2 nics) and latency is below 1ms for 99%-ile.