Fluctuating speeds on 40gb Mellanox

I finally got my MCX353A-FCBT ConnectX-3 cards to connect at 40gb, but I am having some fluctuating speed issues. I was hoping I could get some help figuring this out.

System 1: (PC): Threadripper 1950x, 64gb RAM, x399 mobo, Samsung 970 Evo Plus M.2, Windows 10 Education (basically Enterprise). Mellanox is installed in a PCIex 16 slot.

System 2: (Server):" Xeon W2102m 16gb RAM, Supermicro X11SRA-F mobo, Samsung 970 Evo Plus M.2. Windows Server 2019. Mellanox is installed in a PCIe x16 slot. LSI raid10 with eight 6tb SAS drives.

Points of interest:

I am using Ethernet mode instead of Infiniband.

The latest firmware and drivers are installed.

I have confirmed that RDMA is enabled.

Going from the server to the PC, I am getting about 11 Gbit/s consistently transferring a 20gb file.

Going from the PC to the server, I am getting bursts of about 25Gbit/s for about half of the transfer, then it drops to a stop, resumes to a fast speed again and does this until the transfer is finished. Copying from either m.2 to m.2 drives, or from the m.2 (pc) to the RAID10 array has the same issue.

speed

I have been tweaking some things in the cards configuration (jumbo packets, interrupt moderation, send and receive buffers, large send offload etc…) without any luck.

It feels like a buffering issue of some kind, but am not having any luck pinning this down.

Any help would be greatly appreciated.

speed3

It has come to my attention, and please correct me if I am wrong. Windows will not utilize RDMA using the standard windows file transfer utility. This might also explain why my LAN transfer graph and server memory graph seems to buckle at the same time. See above.

You seem to “mix up” couple of test components in your test-description so I’m not sure I fully understand what is the exact test you’re trying to run?

is it a network performance or a “W/R RAID10 array storage-copy” test you’re running ?

is it over RDMA or is it TCP/UDP? what is the test-tool you’re running?

Assuming you are not using a switch in between but using compatible Mellanox fiber or copper cable connecting “Back-to-Back” between the ConnectX-3 adapters of PC vs. server, then here is what I suggest:

  • Strat with performance fine-tuning of both Win10 & Win2019 as per Mellanox best practice. Use the guidance of WinOF User-Manual,

https://docs.mellanox.com/display/WINOFv55052000/Introduction

  • Next, run RDMA test (nd_read_bw) test between the CX-3 adapters of PC vs. Win2019 server to ensure first you achieve optimum network performance ~40Gb/s between PC & Server. More on RDMA test you’ll find in the User Manual
  • Run “NTttcp” test to ensure you have TCP/UDP optimum network performance ~35-36Gb/s (usually lower then RDMA performance)

https://www.interfacett.com/blogs/performance-testing-and-monitoring-using-free-tool-ntttcp-from-microsoft/

  • if all tests above result are good & expected network performance - you should be able now to get optimum copy performance between the initiator (PC) to the the Storage RAID 10 target. Use “IOmeter_for_Windows” tool "
  • http://www.iometer.org/doc/downloads.html
  • Notice the difference between “network performance” & “copy performance”. copy performance depends on the storage RAID subsystem proper configuration

Hope this helps

Thanks for responding Avi.

I am learning, albeit slowly, that my standard 1gb networking skills are proving to be mostly unuseful learning Mellanox. With this said, I thought it would be a bit easier getting these nics working.

My previous post was copying (TCP) from an m.2 to a m.2. directly connected with no switch. Going from a m.2 to my RAID10 made no difference. What I have found out is that my limited RAM caused the 20gb file to pause/stall. I have since installed 128mb of RAM, but windows server 2019 only allows apx 50% of RAM for disk write cache. Win10 about 10%. So the good news is, I get ultra fast xfers if the file size is apx. 50gb or less.

I am yet to get RDMA working, but am trying to learn how to get it configured.