Issues with ConnnectX-6 Throughput Under Infiniband

Hi Guys:

Currently, we are using two Mellanox 100G VPI NIC[MCX653106A-ECAT]

We realized that under the Infiniband connection. Our RDMA Read/Write could not saturate the 100Gbps, but only to ~77Gbps. as shown below:

We use ib_read_bw /ib_write_bw for testing. Interestingly, when we tried the same physical configuration but under the RoCE case, the throughput could go 93Gbps, as shown below:

RDMA source server has PCIE Gen4, x16
RDMA request client has PCIE Gen3, x16.

We have tried switching client and server in reverse but the throughput also stuck at 77Gbps. We also tried perf, but the same result.

Our cable supports EDR. One of the mlxlink info is shown below:

CA: janux-03 HCA-2:
      0xb83fd20300595d01      2    1[  ] ==( 4X      25.78125 Gbps Active/  LinkUp)==>       1    1[  ] "janux-spr1 HCA-2" ( Could be 53.125 Gbps)
CA: janux-spr1 HCA-2:
      0xb83fd20300595fe1      1    1[  ] ==( 4X      25.78125 Gbps Active/  LinkUp)==>       2    1[  ] "janux-03 HCA-2" ( Could be 53.125 Gbps)
Operational Info
----------------
State                           : Active
Physical state                  : LinkUp
Speed                           : IB-EDR
Width                           : 4x
FEC                             : Standard LL RS-FEC - RS(271,257)
Loopback Mode                   : No Loopback
Auto Negotiation                : ON

Supported Info
--------------
Enabled Link Speed              : 0x00000035 (EDR,FDR,QDR,SDR)
Supported Cable Speed           : 0x0000003f (EDR,FDR,FDR10,QDR,DDR,SDR)

Troubleshooting Info
--------------------
Status Opcode                   : 0
Group Opcode                    : N/A
Recommendation                  : No issue was observed

Tool Information
----------------
Firmware Version                : 20.36.1010
amBER Version                   : 2.09
MFT Version                     : mft 4.23.1-7

We update both NICs to firmware 20.36.1010.

The servers are directly connected to the client with NO switches or components in the middle.

We already tried every method we could find within the community and internet.

Please offer help if possible. Appreciate your help very much.

Jerry

  1. Have you tried pinning the process to a core that resides on the same NUMA as the cards?
  • Need to do it on both ends… To see what NUMA the cards reside on use (assuming you have mft tools installed):
    mst status -v

  • Then use one of the cores on that numa using (among other methods) taskset -c ib_write_bw…

  1. To configure and optimize the machine for BW performance use: mlnx_tune -p HIGH_THROUGHPUT

  2. Ensure the machine is configured to high performance (disable C-states in BIOS etc.)

Ah, I have tried these methods but none of them work.

I pin the server program to CPU 18, The client machine only has one NUMA so it does not matter.
mlnx_tune also set to HIGH_THROUGHPUT already.
Also, frequency was fixed and C-State was disabled.

What is interesting is. No matter what frequency we run our ib_write_bw, 77Gbps is always the ceiling. I don’t recall 77Gbps to be any sort of speed ceiling, Not PCIE speed limits. Not DDR speed limits.

MCX653106A is dual port of 100G EDR, 2X100G, you have to insert PCIE gen4 x16 slot

1 Like

Since the combined bandwidth requirement of 25 GB/s for both 100Gb ports is less than the total available bandwidth of 32 GB/s for a PCIe Gen3 x16 slot, a dual-port 100Gb NIC should be able to operate at full capacity in a PCIe Gen3 x16 slot.

1 Like

Hi Xiaofeng:

I am actually only using one port, leaving the other port unconnected. Technically Gen3 x16 is fast enough to handle one port saturating 100Gbps.

However, Your answer reminds me that for the two ports: I set one port to Infiniband and another port to Ethernet. It is possible that since they are in different modes, they can’t share the PCIE bandwidth in a clever way so Mellanox firmware decided to brutally divide the bandwidth into two parts, one for the ethernet port, and one for the Infiniband port.

I changed the other unconnected port to Infiniband and problem solved

Thank you guys
Jerry

1 Like

Splitting the cards in dual mode (ETH/IB) will affect the buffer management on the cards per port.
Not sure if that is the root cause for this phenomenon.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.