Mellanox Connect X 3 Pro RDMA issues

I’m running a S2D cluster with three Dell PowerEdge R740xd servers. To connect the servers together I’m using two Dell S4048-ON switches. On the Windows Server I installed the WinOF driver and configured DCB with PFC for the lossless network.

The performance of my disk array is poor, and in the SMBServer logs I see the following being logged:

"RDMA connection disconnected.

Transport name: \Device\RdmaSmbIpv4_10.31.1.4

Milliseconds spent closing the connection: 0

Guidance:

Closing an RDMA connection should not take longer than 2 minutes. An RDMA IO that takes an abnormally long time to complete indicates a problem with the RDMA network adapters on this computer or its remote host. Contact your RDMA vendor for an updated driver and further troubleshooting."

And here is the output of vstat:

"hca_idx=0

uplink={BUS=PCI_E Gen3, SPEED=8.0 Gbps, WIDTH=x8, CAPS=8.0*x8}

MSI-X={ENABLED=1, SUPPORTED=128, GRANTED=24, ALL_MASKED=N}

vendor_id=0x02c9

vendor_part_id=4103

hw_ver=0x0

fw_ver=2.42.5000

PSID=MT_1090111023

node_guid=248a:0703:00bb:4210

num_phys_ports=2

port=1

port_guid=268a:07ff:febb:4210

port_state=PORT_ACTIVE (4)

link_speed=NA

link_width=NA

rate=40.00 Gbps

port_phys_state=LINK_UP (5)

active_speed=40.00 Gbps

sm_lid=0x0000

port_lid=0x0000

port_lmc=0x0

transport=RoCE v2.0

rroce_udp_port=0x12b7

max_mtu=2048 (4)

active_mtu=2048 (4)

GID[0]=0000:0000:0000:0000:0000:ffff:0a1f:0104

GID[1]=fe80:0000:0000:0000:3048:ef64:8d42:fbc3

port=2

port_guid=268a:07ff:febb:4211

port_state=PORT_ACTIVE (4)

link_speed=NA

link_width=NA

rate=40.00 Gbps

port_phys_state=LINK_UP (5)

active_speed=40.00 Gbps

sm_lid=0x0000

port_lid=0x0000

port_lmc=0x0

transport=RoCE v2.0

rroce_udp_port=0x12b7

max_mtu=2048 (4)

active_mtu=2048 (4)

GID[0]=0000:0000:0000:0000:0000:ffff:0a1f:0204

GID[1]=fe80:0000:0000:0000:85ba:adfd:5483:2300"

You have not indicated what Windows Server OS is your S2D cluster, neither have you indicated the WinOF version you are using, yet “RDMA connection disconnected” logs prints indicate that you have a “flukey” and unstable RDMA connection

  1. I would suggest that you first make sure you are on the top of the latest GA-release WinOF driver and that you pick up the proper driver per OS.

Use the link bellow do download the WinOF sriver

https://www.mellanox.com/products/adapter-software/ethernet/windows/winof-2

  1. Once driver is installed & sever is rebooted – check you’re on the top of the new driver by running:

PS C:> Get-MLNXPCIDevice | findstr “Driver Version”

DriverVersion : 5.50.14688.0

FirmwareVersion :

(the firmware version 2.42.5000 that you use if fine and the latest)

  1. As for your comment that Disk-Performance is poor – then would suggest that you check first and confirm that your RDMA pure networking performance between the S2D “bare-metal” ConnectX-3 pro adapters is ~ 40Gb/s (use MFT tools “nd_write_bw – use Mellanox driver User-Manuel)

  2. Assuming you have configured RDMA/SMBDirect w/PFC as per Mellanox best practice (see link bellow) then most important is check with Dell and get their confirmation that similar PFC setting that we have in Mellanox switch is implemented in their S4048-ON switches as well

https://community.mellanox.com/s/article/howto-configure-roce-v2-for-connectx-3-pro-using-mellanox-switchx-switches

  1. Use “fio” test to check if storage performance (W/R. IOPs etc…) are as expected

Hi Avi,

I use Windows Server 2019, it is fully updated with the latest patches from Microsoft. I use the WinOF 5.50.52000 driver.