How to troubleshoot/diagnose IB completion errors?

Setup is a ConnectX 5 Ex (MCX516A-CDAT) NIC with firmware v16.30.1004 and MellonxOFED 5.3-.1.0.0.1 basic installation, on Ubuntu 20.04.2 LTS, 5.4.0-74-generic, x86_64.

The primary application catches work completion ibv_wc->status errors after the [ibv_poll_cq](https://www.rdmamojo.com/2013/02/15/ibv_poll_cq/) call - the QPs are setup with IBV_QPT_RAW_PACKET. There is also a printout seemingly from the driver level itself, all posted below. This primary application receives UDP packets.

In an exemplary ping-pong application (from the examples) no such errors occur. I’ve seen a post that attributed a similar error to the form of the packets themselves.

  1. If this is the cause here, what do the packets need to comply to?
  2. If that’s not the case where is the documentation on the vendor error codes and the larger error code that is printed out?

Error printout below: An initial completion error code 0x4, then two 0x10, then 0x5 (this last one repeats indefinitely). The vendor error too jumps around; 0x32, then 0x99 twice then 0xf9 indefinitely. The last 8 hex chars of the mlx5 completion error changes each time.

mlx5: seti-node4: got completion with error:

00000000 00000000 00000000 00000000

00000000 000067ba 07000000 00000000

00000000 20009232 00000000 0000203a

000006c1 920c3204 00000000 000030e0

0: got completion error 0x4 vendor error 0x32 (wr_id 0 qp_num 0)

mlx5: seti-node4: got completion with error:

00000000 00000000 00000000 00000000

00000000 000067ba 07000000 00000000

00000000 20000099 00000000 0000203a

000006c1 000c9922 00000000 000116e0

0: got completion error 0x10 vendor error 0x99 (wr_id 1 qp_num 0)

mlx5: seti-node4: got completion with error:

00000000 00000000 00000000 00000000

00000000 000067ba 07000000 00000000

00000000 20000099 00000000 0000203a

000006c1 000c9922 00000000 000216e0

0: got completion error 0x10 vendor error 0x99 (wr_id 2 qp_num 0)

0: got completion error 0x5 vendor error 0xf9 (wr_id 3 qp_num 4665)

When I use the sender executable of the ping-pong example as the source for the primary application’s packets, the errors are hardly different (this time the packets are not UDP):

mlx5: seti-node4: got completion with error:

00000000 00000000 00000000 00000000

00000000 000067ba 07000000 00000000

00000000 20009232 00000000 00000062

00001ed3 920b3204 00000000 000045e0

0: got completion error 0x4 vendor error 0x32 (wr_id 0 qp_num 0)

mlx5: seti-node4: got completion with error:

00000000 00000000 00000000 00000000

00000000 000067ba 07000000 00000000

00000000 20000099 00000000 00000062

00001ed3 000c9922 00000000 000164e0

0: got completion error 0x10 vendor error 0x99 (wr_id 1 qp_num 0)

mlx5: seti-node4: got completion with error:

00000000 00000000 00000000 00000000

00000000 000067ba 07000000 00000000

00000000 20000099 00000000 00000062

00001ed3 000d9922 00000000 000265e0

The initial completion error of 0x4 indicates the important issue (from RDMAmojo):

  • `IBV_WC_LOC_PROT_ERR (4) - Local Protection Error: the locally posted Work Request’s buffers in the scatter/gather list does not reference a Memory Region that is valid for the requested operation.

This was rectified by correctly linking the lkey of the sge_buffers to that of the registered memory region.