RDMA_CM connection setup issues

ajameson · December 9, 2015, 10:35pm

Hi Mellanox RDMA community,

I’ve come across a (painfully intermittent) problem establishing RDMA_CM connections between processes running on linux machines.

The application involves 8 servers and 40 clients. Each of the 8 servers listens for 40 connections from the clients, and each client opens connections to the 8 servers. After connection setup the clients perform RDMA writes to the servers.

The servers perform the following operations (in 40 posix threads) during connection setup:

rdma_create_event_channel ()

rdma_create_id ()

rdma_bind_addr ()

rdma_listen ()

rdma_get_cm_event () [receives RDMA_CM_EVENT_CONNECT_REQUEST]

rdma_ack_cm_event ()

rdma_create_qp ()

setup PDs, CQs, etc

rdma_accept ()

rdma_get_cm_event () [receives RDMA_CM_EVENT_ESTABLISHED]

and the clients perform the following (in 8 separate posix threads):

rdma_create_event_channel ()

rdma_create_id ()

rdma_resolve_addr ()

rdma_get_cm_event () [receives RDMA_CM_EVENT_ADDR_RESOLVED]

rdma_ack_cm_event ()

rdma_resolve_route ()

rdma_get_cm_event () [receives RDMA_CM_EVENT_ROUTE_RESOLVED]

rdma_ack_cm_event ()

rdma_create_qp ()

setup PDs, CQs, etc

rdma_connect ()

rdma_get_cm_event () [receives RDMA_CM_EVENT_ESTABLISHED]

In the vast majority of these trials this works fine, but very occasionally, a client receives a RDMA_CM_EVENT_CONNECT_ERROR event after calling rdma_connect() and the server receives a corresponding RDMA_CM_EVENT_REJECTED event after the rdma_accept() call.

I thought that the rdma library would be thread safe, but now I’m not so sure. I’ve not been able to gleam any further information about why the connection setup is occasionally failing other than event->status field on the server being 28.

I’d appreciate any tips on how to proceed in debugging this connection problem.

The hardware in use are Mellanox ConnectX-3 FDR10 cards connected to an SwitchX FDR 10 with the Subnet Manager running on a server. Software stack is MLNX_OFED_LINUX-3.0-2.0.1-rhel6.6-x86_64.

Thanks!

Andrew

alekseys1 · December 14, 2015, 4:39pm

Did you try to run smaller jobs size?

May be status filed in event will tell more about the error?

ajameson · April 26, 2017, 11:20pm

For the record, the issue has been largely resolved by ensuring the ibacm service (Infiniband Assistant Communication Manager) was running on all servers.

Topic		Replies	Views
RDMA_CM_EVENT_ROUTE_ERROR	1	472	September 26, 2018
RDMA issue with SoftRoCE Mellanox OFED	3	378	August 13, 2015
Can someone tell me what is wrong with this Mellanox 4 cards? Adapters and Cables	1	1473	October 18, 2019
Mellanox ConnectX-5 EN 25GB Dual Port SPF Rdma does not work properly InfiniBand/VPI Adapter Cards	1	1113	March 21, 2023
Is there any utility to generate log to diagnose RDMA operation? WinOF Driver	1	289	December 25, 2013
ConnectX-6 DX missing the RDMA Mellanox OFED	2	1008	June 26, 2023
Mellanox Connect X 3 Pro RDMA issues Software And Drivers	2	831	February 7, 2020
The performance of event APIs could be bounded by softirqs Mellanox OFED rdma-and-roce , infiniband	2	951	May 22, 2022
Issues with setting up Storage Spaces Direct	2	257	September 15, 2017
"Protocol not supported" when trying to add rdma to nfs portlist Mellanox OFED	3	1885	May 29, 2019

RDMA_CM connection setup issues

Related topics