I have problem porting my RDMA application from InfiniBand(Mellanox Connectx-3 40Gb IB) to RoCE(Connectx-4 100GbE).

So, I have a small application written in C testing RDMA write. It works perfectly on Mellanox ConnectX-3 40Gb IB NIC. We got new Mellanox ConnectX-4 100GbE hardware, which supports RoCE (Testing with ‘ib_send_bw’ tool shows its throughput is close to 98Gbps, which is exciting). I did some modification to the code at changing queue pair to RTR/RTS state:

  1. set queue pair attribute: attr->ah_attr.grh fields

  2. set attr->ah_attr.is_global to 1

The problem happens at ibv_poll_cq() after RDMA write requests are sent. The work completion object(struct ibv_wc) reports failure with status=10(IBV_WC_REM_ACCESS_ERR). I double checked my ibv_reg_mr() call, it does have all of the access modes set up:

===============================================================

ctxt.mr=ibv_reg_mr(ctxt.pd, ctxt.pages, page_size*MAX_PAGE,

IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_READ | IBV_ACCESS_REMOTE_ATOMIC );

===============================================================

I’m wondering what’s happen and I printed the vendor_err in ibv_wc object:

status=10, qp_num=323, vendor_err=136

I can’t find a reference explaining vendor_err=136, but I do have some information reported by the driver (it should be in ./libmlx5-1.0.2mlnx1/src/cq.c)

================================================================

mlx5: compute28: got completion with error:

00000000 00000000 00000000 00000000

00000000 00000000 00000000 00000000

00000000 00000000 00000000 00000000

00000000 00008813 08000143 0000fed0

================================================================

I guess those numbers mean something to Mellanox people. I hope you can help me out of this problem. BTW, the OFED version I use is 3.2-2.0.0.0 for ubuntu12.04-x86_64

Thanks Erez. Actually I found the solution later. It was a bug in my code. I didn’t correctly set up the gid in the global routing header in queue attributes. After I fix that bug, it works perfectly.

Hi,

I recommend opening a ticket to support@mellanox.com mailto:support@mellanox.com and provide a code snippet of what you’re trying to do so it will go into the appropriate resource.