RDMA_CM connection setup issues

Hi Mellanox RDMA community,

I’ve come across a (painfully intermittent) problem establishing RDMA_CM connections between processes running on linux machines.

The application involves 8 servers and 40 clients. Each of the 8 servers listens for 40 connections from the clients, and each client opens connections to the 8 servers. After connection setup the clients perform RDMA writes to the servers.

The servers perform the following operations (in 40 posix threads) during connection setup:

rdma_create_event_channel ()

rdma_create_id ()

rdma_bind_addr ()

rdma_listen ()

rdma_get_cm_event () [receives RDMA_CM_EVENT_CONNECT_REQUEST]

rdma_ack_cm_event ()

rdma_create_qp ()

setup PDs, CQs, etc

rdma_accept ()

rdma_get_cm_event () [receives RDMA_CM_EVENT_ESTABLISHED]

and the clients perform the following (in 8 separate posix threads):

rdma_create_event_channel ()

rdma_create_id ()

rdma_resolve_addr ()

rdma_get_cm_event () [receives RDMA_CM_EVENT_ADDR_RESOLVED]

rdma_ack_cm_event ()

rdma_resolve_route ()

rdma_get_cm_event () [receives RDMA_CM_EVENT_ROUTE_RESOLVED]

rdma_ack_cm_event ()

rdma_create_qp ()

setup PDs, CQs, etc

rdma_connect ()

rdma_get_cm_event () [receives RDMA_CM_EVENT_ESTABLISHED]

In the vast majority of these trials this works fine, but very occasionally, a client receives a RDMA_CM_EVENT_CONNECT_ERROR event after calling rdma_connect() and the server receives a corresponding RDMA_CM_EVENT_REJECTED event after the rdma_accept() call.

I thought that the rdma library would be thread safe, but now I’m not so sure. I’ve not been able to gleam any further information about why the connection setup is occasionally failing other than event->status field on the server being 28.

I’d appreciate any tips on how to proceed in debugging this connection problem.

The hardware in use are Mellanox ConnectX-3 FDR10 cards connected to an SwitchX FDR 10 with the Subnet Manager running on a server. Software stack is MLNX_OFED_LINUX-3.0-2.0.1-rhel6.6-x86_64.

Thanks!

Andrew

Did you try to run smaller jobs size?

May be status filed in event will tell more about the error?

For the record, the issue has been largely resolved by ensuring the ibacm service (Infiniband Assistant Communication Manager) was running on all servers.