Is this the best our FDR adapters can do?

We have a small test setup illustrated below. I have done some ib_write_bw tests. Got “decent” numbers, but not as fast as I anticipated. First, some background of the setup:

Two 1U storage servers each has a EDR HCA MCX455A-ECAT. The other four each has a ConnectX-3 VPI FDR 40/50Gb/s HCA mezz http://www.mellanox.com/related-docs/prod_adapter_cards/PB_ConnectX3_VPI_Card_Dell.pdf OEMed by Mellanox for Dell. The firmware version: 2.33.5040. This is not the latest (2.36.5000 according to hca_self_test.ofed) but I am new to IB, and still getting up to speed with updating using Mellanox’s firmware tools. The EDR HCA firmware has been updated when the MLNX_OFED was installed.

All servers:

CPU: 2 x Intel E5-2620v3 2.4Ghz 6 core/12 HT

RAM: 8 x 16GiB DDR4 1866Mhz DIMMs

OS: CentOS 7.2 Linux … 3.10.0-327.28.2.el7.x86_64 #1 SMP Wed Aug 3 11:11:39 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

OFED: MLNX_OFED_LINUX-3.3-1.0.4.0 (OFED-3.3-1.0.4)

A typical ib_write_bw test:

Server:

[root@fs00 ~]# ib_write_bw -R


  • Waiting for client to connect… *


RDMA_Write BW Test

Dual-port : OFF Device : mlx5_0

Number of qps : 1 Transport type : IB

Connection type : RC Using SRQ : OFF

CQ Moderation : 100

Mtu : 2048[B]

Link type : IB

Max inline data : 0[B]

rdma_cm QPs : ON

Data ex. method : rdma_cm


Waiting for client rdma_cm QP to connect

Please run the same command with the IB/RoCE interface IP


local address: LID 0x03 QPN 0x01aa PSN 0x23156

remote address: LID 0x05 QPN 0x4024a PSN 0x28cd2e


#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]

65536 5000 6082.15 6081.07 0.097297


Client:

[root@sc2u0n0 ~]# ib_write_bw -d mlx4_0 -R 192.168.111.150


RDMA_Write BW Test

Dual-port : OFF Device : mlx4_0

Number of qps : 1 Transport type : IB

Connection type : RC Using SRQ : OFF

TX depth : 128

CQ Moderation : 100

Mtu : 2048[B]

Link type : IB

Max inline data : 0[B]

rdma_cm QPs : ON

Data ex. method : rdma_cm


local address: LID 0x05 QPN 0x4024a PSN 0x28cd2e

remote address: LID 0x03 QPN 0x01aa PSN 0x23156


#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]

65536 5000 6082.15 6081.07 0.097297


Now 6082MB/s ~ 48.65Gbps. Even taking into account of the 64/66 encoding overhead, over 50+Gbps should be the case, or is this the best the setup can do? Or is there anything I can do to push up the speed further?

Look forward to hearing the experience and observations from the experienced camp! Thanks!

Thanks for sharing your experience. I did the following:

[root@sc2u0n0 ~]# dmidecode |grep PCI

Designation: PCIe Slot 1

Type: x8 PCI Express 3 x16

Designation: PCIe Slot 3

Type: x8 PCI Express 3

lspci -vv

[…]

02:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]

[…]

LnkCap: Port #8, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s unlimited, L1 unlimited

ClockPM- Surprise- LLActRep- BwNot-

So, the theoretical speed should be 8Gbps/lane x 8 lane x 128b/130b https://en.wikipedia.org/wiki/PCI_Express#PCI_Express_3.0 = 63 Gbps. In fact, we just did a fio sweep using fio-2.12. The read is quite reasonable. We are now investigating why the write is so low.

A. Read test results

  • Chunk size = 2 MiB
  • Num. Jobs = 32
  • IO Depth = 128
  • File size = 500 GiB
  • Test time = 360 seconds
    ModeSpeed, GbpsIOPSpsync, direct47.772986psync, buffered24.491530libaio, direct49.17

3073

B. Write test results

  • Chunk size = 2 MiB
  • Num. Jobs = 32
  • IO Depth = 128
  • File size = 500 GiB
  • Test time = 360 seconds
    ModeSpeed, GbpsIOPSpsync, direct24.141509psync, buffered9.32583libaio, direct22.511407

I think I have the answer now. It’s due to the confusion caused by the prevalent and inconsistent use of MB and MiB out there in different software applications.

When I ran ib_write_bw with the --report_gbits flag, I did see over 50+ Gbps. That got me curious, so I assumed the MB/s output to be actually MiB/s, then 6028MiB/s = 51.02Gbps, as anticipated.

One thing to keep in mind is that you’ll hit the bandwidth of the PCIe bus.

I’ve not used the ib_write test myself - but I’m fairly sure that it’s not actually handling data - just accepting it and tossing it away so it’s going to be a theoretical maximum.

In real life situations that bus is going to be handling all data in/out of the CPU and for my oldest motherboards that maxes out at 25Gb/s - which is what I hit with fio tests on QDR links. I’ve heard that with PCIe gen 3 you’ll get up to 35Gb/s.

Generally whenever newer networking tech rolls out there is nothing that a single computer can do to saturate the link - unless it’s pushing junk data and the only way to really max it out is for switch-switch (hardware to hardware) traffic.

Of course using IPoIB an anything other than native IB traffic is going to cost you performance. In my case of NFS with IPoIB (with or without RDMA) I quickly slam into the bandwidth of my SSDs. The only exception I’ll have is the Oracle dB where the low latency is what I’m after as the database is small enough to fit in RAM.