I am hosting an Infiniband server in a linux machine and also I have created a client and connected to that service in the same machine.This works fine most of the time. But in one instance when I was trying to connect to that server from the same client (there is no prior connectivity with that server), It is throwing RDMA_CM_EVENT_ROUTE_ERROR and the connection couldn’t be established.

I don’t know the root cause of this error and it is not 100% recreatable. This made my application unreliable. I want to know the root cause of it…

Without reproduction it is impossible to resolve it. However, RDMA_CM_EVENT_ROUTE_ERROR usually mean that there is no route to specific host. You can verify it by using simple ‘ping’ command. Try to analyze routing table on your host (maybe it has duplicate entries), if you are using dual port card, check if disconnect one port makes the issue go away.

rdma_resolve _route depends on OS kernel routing and if it doesn’t work, RDMA route resolution will fail.

Verify that you are using the latest version of Mellanox OFED stack.

Be sure to use latest firmware version

Be sure you are using subnet manager that comes with Mellanox OFED stack

Check the output of ibv_devinfo command and be sure that ‘guid’ of the node are not ‘0’ (zero’s).

Check ‘dmesg’/syslog files,maybe you’ll see an additional info that will help.

As additional diagnostic, check with ‘ib_read_lat’ application for example and using’ ‘-R’ flag. if it works and your application doesn’t , check the application code.