Low throughput on Connectx-3 on both Linux and Windows

Hi all,

I​ am new to the Mellanox community and would appreciate some help/advice. I have 2 Connectx-3 adapters (MCX353A-FCBT) between two systems and am not getting the speeds I believe I should be getting. I am utilizing ethernet mode for both cards, and I want to use RoCE.

  • The first system is a Dell R620 with 2 x E5-2660 CPU’s, 192gb of RAM, and 2 * 1TB 850 EVO SSD’s in RAID 1.

  • The second system is a custom built ITX system with an i7-8700k​, 32gb RAM, 1 * Samsung 970 evo M.2 NVMe drive, and 1 * 1TB 850 EVO SSD.

Both systems are dual boot, so I can use Windows Server 2016 or Linux (CentOS 7) at​ any given point.

For Windows, I made sure that SMB was enabled on both cards through powershell and made sure that RDMA capable and the RSS capable were true. Afterwards, I then followed the Mellanox guide to configuring RoCE below:

https://community.mellanox.com/s/article/howto-configure-smb-direct--roce--over-pfc-on-windows-2012-server

The only thing that was changed was to remove the ​ETS bandwidth limiter, as it isn’t necessary. Everything else was followed and changed where necessary (i.e: different name for the card in Windows).

For Linux, I had followed the steps on redhat’s site to configuring RoCE (I’m using CentOS 7 for both systems, so the steps are the same):

​https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/networking_guide/index#sec-Tranferring_Data_Using_RoCE

​However, I get the same results as I would in Windows.

For the card info:

  • Both cards are the same model number and have the same firmware (2.42.500)
  • Both are in Ethernet mode
  • Both are in PCIe 3.0 x8 slots

​I can provide any outputs necessary from commands/screenshots of any testing that can be done, so long as I know what I need to provide. I’m just at a standstill and am trying to figure out where the problem lies or if I have a bottleneck somewhere.

A help is greatly appreciated.

Thank you in advance.

Hi Martin,

I am not sure which tool/application is used for testing performance and what is the protocol type.

In General:

Linux-Linux:

When coming to measure TCP/UDP performance between 2x Mellanox ConnectX-3 adapters on Linux platforms - our recommendation is to use iperf2 tool. We will test RDMA performance using “ib_write_bw” test.

Suggestions:

  • Ensure you’ve installed the latest MLNX_OFED driver & latest firmware.

  • Implement the “performance tuning” on driver / firmware / HW parameters, as per Mellanox best practice:

https://community.mellanox.com/s/article/performance-tuning-for-mellanox-adapters

Windows-Windows:

When coming to measure performance between 2x Mellanox ConnectX-3 adapters on Windows platforms – our recommendation is that you use NTttcp tool rather than iperf2 tool.

Suggesting:

  • Ensure you’ve installed the latest WinOF driver & latest firmware.

-Implement the “performance tuning” on driver / firmware / HW parameters, as per Mellanox best practice. Use the guidance of WinOF User-Manual: https://docs.mellanox.com/display/WINOFv55052000/Introduction

  • Run “NTttcp” test to ensure you have TCP/UDP optimum performance.

https://www.interfacett.com/blogs/performance-testing-and-monitoring-using-free-tool-ntttcp-from-microsoft/

Best Regards,

Chen

Hi Chen.

Thank you for your response. I apologize that I was not clear enough as to what I am using for testing platforms and how.

I seemed to partially resolve my issue by resetting my configurations of the card and removing the powershell script on the windows link showed above. So my speeds are better and more consistent, but not where I would deem totally acceptable.

What I had also changed in Windows was the performance tuning set from the default from Balanced to Single Port Mode on both cards. The margin of error was negligible, but not enough to warrant a noticeable difference in performance

Here is what I am using to test my throughput and transfer speeds:

Atto Disk Benchmark (across both servers and all Drives + RAM disk)

CrystalDiskMark Benchmark (across both servers and all Drives + RAM disk)

iperf2 2.0.9 (on both OS’s)

Ntttcp (Windows)

ib_send_bw (Linux)

ib_send_lat (Linux)

As you specified, I added the ntttcp test for Windows.

I was previously testing the Linux methods on Linux and did use ib_send_bw. I did recall that ib_send_lat came back with an error, but haven’t gotten to it yet. I’ll add some outputs later in the week for the Linux testing. I also will post CrystalDiskMark results around the same time.

For now, I’ve attached my output of ntttcp in a text file of the R620. What I noticed was that on single threaded tests, it would run just fine. However, the results were effectively that of 10g speeds (which is to be expected, since the cards would need to leverage multi-threading for better performance). Yet, when I try to run a multi-threaded test, it would hang on the version information. I’ve tried to adjust some of the values in the command but it would still give me the same results. Within that file, I also ran a second test but with some modified values (an asynchronous test with a 2m size) for comparison.

The next item is the atto benchmarking from both servers. I’ve omitted the RAM disk results, as I ran out of time to test them today. However, I’ve included screenshots of both instances to their respective drives. Going from the ITX to the R620 yielded results that were good on the RAID 1 SSD’s, but much slower on the RAID 5 SSD’s (which is to be expected, as RAID 5 is typically very slow). I’ve yet to test a RAID 10 configuration just to see what performance I get (which I would assume would fall between both RAID levels).

The biggest problem I see is going from the R620 to the ITX system. And its pretty slow, as by the screenshot I took. Its almost as if I’m running the tests locally and getting the same results. This is where I had checked and made sure that everything was identical to the R620, which it is and am stuck on and am trying to figure out if I’m running into a bottleneck of some sort.

Again, I’ve attached the output and screenshots for your reference. If there’s anything you would like for me to test, let me know and I’ll be happy to do it.

Thank you.

Hi Chen,

As an update, In was finally able to perform the test on Linux.

Attached is the output of the ib_write_bw test as you requested.

Based on the test, it seems to be reaching closer to the 40gb mark. The average is about 32gb (as dividing 32gbps / 8 = 4GB/s). However, I would like to be somewhat above the 35gbps (or 4.5GB/s), if possible.

I’ve already done my tuning with the OS as best I can. I am using the high throughput profile on both ends, however my transfer speeds are still relatively low when copying the files. I have created a samba share on both ends in Linux. For the ITX server, it is running a GUI of centos 7 while the r620 is on centos 7 minimal.

I’m able to catch a glimpse of the transfer rates of the files and it is still about 500MB/s, which is the speed of the SSD’s.

So it seems that the bottleneck is still either the drives or maybe something within the configuration.