Completion errors

During scaled up runs of DAOS stack on the Frontera/TACC cluster, we occasionally experience completion errors coming out of mlnx driver that look like:

c121-063.frontera.tacc.utexas.edu ERROR 2021/04/05 17:38:12 daos_engine:0 mlx5: c121-063.frontera.tacc.utexas.edu: got completion with error:

00000000 00000000 00000000 00000000

00000000 00000000 00000000 00000000

0000222f 00000000 00000000 00000000

00000000 00008813 080096cd 0351fad2

Is there a way to decode those errors to understand/debug what went wrong?

MLNX_OFED_LINUX-5.3-1.0.0.1:

Provider: verbs;ofi_rxm

OFI: v1.12.0

Few other completion errors from a different cluster running the same reproducer:

wolf-118: Jun 19 00:12:33 wolf-118. srv[14520]: ERROR: daos_engine:1 mlx5: wolf-118.: got completion with error:

wolf-118: Jun 19 00:12:33 wolf-118. srv[14520]: 00000000 00000000 00000000 00000000

wolf-118: Jun 19 00:12:33 wolf-118. srv[14520]: 00000000 00000000 00000000 00000000

wolf-118: Jun 19 00:12:33 wolf-118. srv[14520]: 0000014c 00000000 00000000 00000000

wolf-118: Jun 19 00:12:33 wolf-118. srv[14520]: 00000000 00008813 10002cce 00449dd3

wolf-119: – Logs begin at Fri 2021-06-18 21:07:29 UTC, end at Mon 2021-06-21 14:12:30 UTC. –

wolf-119: Jun 19 00:12:33 wolf-119. srv[14655]: ERROR: daos_engine:1 mlx5: wolf-119.: got completion with error:

wolf-119: Jun 19 00:12:33 wolf-119. srv[14655]: 00000000 00000000 00000000 00000000

wolf-119: Jun 19 00:12:33 wolf-119. srv[14655]: 00000000 00000000 00000000 00000000

wolf-119: Jun 19 00:12:33 wolf-119. srv[14655]: 0000016e 00000000 00000000 00000000

wolf-119: Jun 19 00:12:33 wolf-119. srv[14655]: 00000000 00008813 100029a0 002851d2

wolf-120: – Logs begin at Fri 2021-06-18 21:07:00 UTC, end at Mon 2021-06-21 14:12:30 UTC. –

wolf-120: Jun 19 00:12:33 wolf-120. srv[14654]: ERROR: daos_engine:1 mlx5: wolf-120.: got completion with error:

wolf-120: Jun 19 00:12:33 wolf-120. srv[14654]: 00000000 00000000 00000000 00000000

wolf-120: Jun 19 00:12:33 wolf-120. srv[14654]: 00000000 00000000 00000000 00000000

wolf-120: Jun 19 00:12:33 wolf-120. srv[14654]: 00000289 00000000 00000000 00000000

wolf-120: Jun 19 00:12:33 wolf-120. srv[14654]: 00000000 00008813 10002a9b 00589bd2

wolf-120: Jun 19 00:12:33 wolf-120. srv[14654]: mlx5: wolf-120.: got completion with error:

wolf-120: Jun 19 00:12:33 wolf-120. srv[14654]: 00000000 00000000 00000000 00000000

wolf-120: Jun 19 00:12:33 wolf-120. srv[14654]: 00000000 00000000 00000000 00000000

wolf-120: Jun 19 00:12:33 wolf-120. srv[14654]: 00000299 00000000 00000000 00000000

wolf-120: Jun 19 00:12:33 wolf-120. srv[14654]: 00000000 00008813 10002aab 005da1d2

wolf-120: Jun 19 00:12:33 wolf-120. srv[14654]: mlx5: wolf-120.: got completion with error:

wolf-120: Jun 19 00:12:33 wolf-120. srv[14654]: 00000000 00000000 00000000 00000000

wolf-120: Jun 19 00:12:33 wolf-120. srv[14654]: 00000000 00000000 00000000 00000000

wolf-120: Jun 19 00:12:33 wolf-120. srv[14654]: 000002a4 00000000 00000000 00000000

wolf-120: Jun 19 00:12:33 wolf-120. srv[14654]: 00000000 00008813 10002ab6 004994d2

wolf-121: – Logs begin at Fri 2021-06-18 21:07:29 UTC, end at Mon 2021-06-21 14:12:31 UTC. –

wolf-121: Jun 19 00:12:33 wolf-121. srv[14622]: ERROR: daos_engine:0 mlx5: wolf-121.: got completion with error:

wolf-121: Jun 19 00:12:33 wolf-121. srv[14622]: 00000000 00000000 00000000 00000000

wolf-121: Jun 19 00:12:33 wolf-121. srv[14622]: 00000000 00000000 00000000 00000000

wolf-121: Jun 19 00:12:33 wolf-121. srv[14622]: 00000162 00000000 00000000 00000000

wolf-121: Jun 19 00:12:33 wolf-121. srv[14622]: 00000000 00008813 10002b7f 003520d2

In our experience, it is application/software error and not an Nvidia issue

Please, check “7.12.7 Completion With Error” section in Ethernet Adapters Programming Manual https://www.mellanox.com/related-docs/user_manuals/Ethernet_Adapters_Programming_Manual.pdf

0x13 syndrome is “Remote Access Error” that might be caused by invalid R_Key, for example.

Please, add more debug information to your application to verify all data. Most likey, it is some memory corruption.

Similar issue can be observed by running ib_read_bw application and using different message sizes on both sides

On one side:

ib_read_bw -d mlx5_0 -s 128

On the other:

ib_read_bw -x mlx5_0 -s 65535

Result:

One side:


#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]

ethernet_read_keys: Couldn’t read remote address

Unable to read to socket/rdma_cm

Failed to exchange data between server and clients

Other side:

#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]

mlx5: b-csi-0527s: got completion with error:

00000000 00000000 00000000 00000000

00000000 00000000 00000000 00000000

00000000 00000000 00000000 00000000

00000000 00008813 10000123 000084d2

Completion with error at client

Failed status 10: wr_id 0 syndrom 0x88

scnt=128, ccnt=0

If after debugging, you’ll discover the issue with Nvidia component, don’t hesitate to open an standard support ticket as your organization has a support contract with us.