Bad RoCEv2 throughput with ConnectX-5

My setup is rather simple: Host (ConnectX-5) === (1X100G port) switch (4X25G ports) === 4 NVMoF targets

Both 100G (QSFP28 active fiber) and 25G (SFP28 copper) cables are from Mellanox.

Host has i9-7940X (14C/28T) with 32G DRAM. Running Ubuntu 18.04 (kernel 4.15.0-36). I also tried Ubuntu 17.10 on the same PC with a slightly older kernel. The issue remains the same.

When I run FIO against 1 target, 4K IOPS is 675K. This is limited by the 25G link connecting the target.

When I run FIO against 2 targets in parallel, the aggregated IOPS is 1340K. This is still OK.

The problem is, when I run FIO against 3 or 4 targets in parallel, the aggregated IOPS is only 1550-1600K. This is significantly lower than the expected 2600-2700K.

I also try taskset so that each of the 4 FIOs is pinned to 6 CPUs (each FIO has 6 threads). htop shows the CPUs are only 40-50% loaded most of the time (some of them do spike to 90%+ for a very short period of time) and CPUs are not waiting for the IOs.

None of the switch ports drops any packets. I also try a different switch. Same result.

I also try btest instead of FIO on the host. Same result.

When I insert the 4 SSDs used in those targets directly into the host, the aggregated IOPS is 2600K+. So this seems to indicate CPU/Bus/OS/FIO are able to support that high IOPS.

I try iperf between the host and the targets. The aggregated throughput is about 92G. So that seems to indicate there is no issue with this data path for the normal TCP/IP traffic. The issue seems to be tied to RoCEv2.

I did set default_roce_mode for ConnectX-5 to “RoCE v2”. Do I need to configure something else for ConnectX-5 to support RoCEv2 at full speed?

Thanks,

Yao