Intermittent timeout when modifying QP to RTR with RoCE

I am using CX6 cards (4123).
I create 4 QPs, wiring to 4 different remote QPs, the first two QP were successful. The third one timeout (errno=110).
At first I think there is a network issue that the NIC can’t resolve the remote IP’s MAC. so I print the MAC entry before and after the ibv_modify_qp(RTR) call.
before, there is no MAC entry, but right after the call, the MAC entry exists. However, ibv_modify_qp(RTR) already failed.
This is intermittent and hard to reproduce.
In the Mellanox driver, what is the reason to generate errno=110 with ibv_modify_qp(RTR) ?

Thanks for any suggestion for solve this problem.

By the way, two years ago, there was a similar question:

But it was closed and I did not get the answer from the dialog.

Hi @ctang1207,

The errno value 110 commonly means “Connection timed out”.

Due to a lack of relevant information, I can’t investigate further.

It would be welcomed if you could provide more information about your test environment, such as:

  • Your test environment system information (Run sysinfo-snapshot.py script after installing OFED or DOCA then you can find the generated sysinfo file named sysinfo-snapshot-xxx.tgz in /tmp/)
  • Your testing environment topology.
  • Your testing code and relevant compile method.
  • The operating steps to reproduce the issue.

Best regards.

I am unable to provide the info. It happens once every a few days on different machines.
I looked the Linux kernel source code to try to figure out why ibv_modify_qp(RTR) return ETIMEDOUT. here is the related driver code:
_ib_modify_qp() -->ib_resolve_eth_mac()–>ib_resolve_unicast_gid_mac()–>rdma_addr_find_l2_eth_by_grh()–>rdma_resolve_ip()–>process_one_req():

	if (req->status == -ENODATA) {
		src_in = (struct sockaddr *)&req->src_addr;
		dst_in = (struct sockaddr *)&req->dst_addr;
		req->status = addr_resolve(src_in, dst_in, req->addr,
					   true, req->resolve_by_gid_attr,
					   req->seq);
		if (req->status && time_after_eq(jiffies, req->timeout)) {
			req->status = -ETIMEDOUT;
		} else if (req->status == -ENODATA) {
			/* requeue the work for retrying again */
			spin_lock_bh(&lock);
			if (!list_empty(&req->list))
				set_timeout(req, req->timeout);
			spin_unlock_bh(&lock);
			return;
		}
	}

This is the only place to set -ETIMEDOUT code and return back to user space. However, I don’t know why addr_resolve() continue to return non-zero code. the calls inside addr_resolve() is very deep and I can’t figure out the reson without actually tracing the code (also need the kernel tracing environment).

Do you have knowledge why addr_resolve() fails intermittently? Here process_one_req() is a work-queue function. addr_resolve() is already called once inside rdma_resolve_ip() (non-work-queue call).

Hi @ctang1207 ,

Thanks for your response.

Due to a lack of relevant information, it’s not possible to make a valid judgment.

It would be welcomed if you could provide more information about your test environment, such as:

  • Your test environment system information (Run sysinfo-snapshot.py script after installing OFED or DOCA then you can find the generated sysinfo file named sysinfo-snapshot-xxx.tgz in /tmp/)
  • Your testing environment topology.
  • Your testing code and relevant compile method.
  • The operating steps to reproduce the issue.

You mentioned “It happens once every a few days on different machines”, you can provide information from one of these machines.

Best regards.

I think the info in driver code I provided is clear enough. your driver engineer should be able to figure out why the driver code return ETIMEDOUT to user space.