RDMA issues "ConnectX-5 Ex 40GbE Dual-Port QSFP28"

Hey Everyone!

I’m facing a strange issue here that I can’t seem to solve :(

Long story short, I have around 20 servers to setup.

After installing the first 9, I realized that some of them are acting up.

The weird thing that is happening is that when setting up RDMA, most of them seem to be able to communicate nicely (receive/send) but some just won’t receive or send…

I have 2 types of servers where the second one is the most troublesome one.

The install is identical, they all have the same identical Ubuntu 20.04 setup and the first 6 are completely identical in terms of hardware components and the last 3 are identical as well.

They all however have the same identical NIC “ConnectX-5 Ex 40GbE Dual-Port QSFP28”.

Here’s an overview i made of which ones that are able and not able to ping (using rping).

overviewConnectivity

so the MA server can connect to w1,w2,… and vice versa.

Same for the W* ones.

Then the weirdness starts.

C1, C2 and C3 can rping to MA, W1, W3, W4 but NOT to W2

Also C1 is able to rping C3 but not the other way around.

I hope someone can point me into the right direction where to troubleshoot because i’m out of ideas.

Already tried to replace transceiver modules, swap cables around…

Nothing seemed to have worked.

Edit: iperf3 works (both ways), regular ping works, just now installed the latest driver, nothing changed…

Any advice would be extremely appreciated!!

Thank you in advance!

What is the error you are getting?

Are you able to run ib_write_bw? ib_write_bw -R?

Hey @Aleksey Senin​ and thank you for your response.

The first 7 servers can ib_write_bw to and from each other without any problems whatsoever.

The C series servers can ib_write to and from 3 of those 7 servers but not the other 4 (while they are identical and have an identical setup)

Example of a finished test.

---------------------------------------------------------------------------------------

RDMA_Write BW Test

Dual-port : OFF Device : mlx5_0

Number of qps : 1 Transport type : IB

Connection type : RC Using SRQ : OFF

PCIe relax order: ON

ibv_wr* API : ON

TX depth : 128

CQ Moderation : 1

Mtu : 4096[B]

Link type : Ethernet

GID index : 3

Max inline data : 0[B]

rdma_cm QPs : OFF

Data ex. method : Ethernet

---------------------------------------------------------------------------------------

local address: LID 0000 QPN 0x00e5 PSN 0x72f5e5 RKey 0x1c4b6a VAddr 0x007ffff7aea000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:00:10

remote address: LID 0000 QPN 0x00b6 PSN 0xef4205 RKey 0x183fee VAddr 0x007ffff7ae9000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:00:25

---------------------------------------------------------------------------------------

#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]

Conflicting CPU frequency values detected: 2973.833000 != 3299.583000. CPU Frequency is not max.

65536 5000 4667.63 4663.27 0.074612

---------------------------------------------------------------------------------------

Failed one (sending)

---------------------------------------------------------------------------------------

RDMA_Write BW Test

Dual-port : OFF Device : mlx5_0

Number of qps : 1 Transport type : IB

Connection type : RC Using SRQ : OFF

PCIe relax order: ON

ibv_wr* API : ON

TX depth : 128

CQ Moderation : 1

Mtu : 4096[B]

Link type : Ethernet

GID index : 3

Max inline data : 0[B]

rdma_cm QPs : OFF

Data ex. method : Ethernet

---------------------------------------------------------------------------------------

local address: LID 0000 QPN 0x00d1 PSN 0x52f93f RKey 0x1b482d VAddr 0x007ffff7aea000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:00:12

remote address: LID 0000 QPN 0x00b7 PSN 0xe84afa RKey 0x183fef VAddr 0x007ffff7ae9000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:00:25

---------------------------------------------------------------------------------------

#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]

Completion with error at client

Failed status 12: wr_id 0 syndrom 0x81

scnt=128, ccnt=0

Failed to complete run_iter_bw function successfully

I’ve been trying to figure out why, how, but currently no solution yet :(

The only difference here is the servers themselves.

However, they all have the same exact NICs…

Just checked lsmod and same modules are loaded as well…

Hi Elio,

What type of topology is this?

What switch/switches are in between?

Is PFC/ECN configured? Any other special configuration on the switch?

Is there any major difference between the way these servers are configured (Bonding, etc)?

Different VLANs?

Do we see the same behavior when utilizing RDMA_CM (ib_write_bw -R)?

Can you provide the output of:

show_gids

as well?

Thanks,

Mellanox Technical Support

Be sure you are using latest MOFED v5.3

Try exclude switch and connect hosts directly one to another

Validate that there is no firewall that can block the traffic

Be sure you are using correct IP configuration and not using both ports with IP from same subnet. Test with only one port up and other down/disconnected

Concentrate on C1/C3 communication. You might use tcpdump with latest libpcap traffic to capture RoCE traffic and see if packets transferred between the hosts and communication brakes in the middle

Swap the cards and see if the issue follows the cards or stays with the host

Swap ports on the switch and see if the issue stays with server or follow the switch port

Swap the cables to see if the issue follow the cable or stays with host/switch port.

Further troubleshooting most likely will require collecting logs and analyzing them that requires valid support contract and if you have one, you may open a support ticket with networking-support@nvidia.com