During scaled up runs of DAOS stack on the Frontera/TACC cluster, we occasionally experience completion errors coming out of mlnx driver that look like:
c121-063.frontera.tacc.utexas.edu ERROR 2021/04/05 17:38:12 daos_engine:0 mlx5: c121-063.frontera.tacc.utexas.edu: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
0000222f 00000000 00000000 00000000
00000000 00008813 080096cd 0351fad2
Is there a way to decode those errors to understand/debug what went wrong?
MLNX_OFED_LINUX-5.3-1.0.0.1:
Provider: verbs;ofi_rxm
OFI: v1.12.0
Few other completion errors from a different cluster running the same reproducer:
wolf-118: Jun 19 00:12:33 wolf-118. srv[14520]: ERROR: daos_engine:1 mlx5: wolf-118.: got completion with error:
wolf-118: Jun 19 00:12:33 wolf-118. srv[14520]: 00000000 00000000 00000000 00000000
wolf-118: Jun 19 00:12:33 wolf-118. srv[14520]: 00000000 00000000 00000000 00000000
wolf-118: Jun 19 00:12:33 wolf-118. srv[14520]: 0000014c 00000000 00000000 00000000
wolf-118: Jun 19 00:12:33 wolf-118. srv[14520]: 00000000 00008813 10002cce 00449dd3
wolf-119: – Logs begin at Fri 2021-06-18 21:07:29 UTC, end at Mon 2021-06-21 14:12:30 UTC. –
wolf-119: Jun 19 00:12:33 wolf-119. srv[14655]: ERROR: daos_engine:1 mlx5: wolf-119.: got completion with error:
wolf-119: Jun 19 00:12:33 wolf-119. srv[14655]: 00000000 00000000 00000000 00000000
wolf-119: Jun 19 00:12:33 wolf-119. srv[14655]: 00000000 00000000 00000000 00000000
wolf-119: Jun 19 00:12:33 wolf-119. srv[14655]: 0000016e 00000000 00000000 00000000
wolf-119: Jun 19 00:12:33 wolf-119. srv[14655]: 00000000 00008813 100029a0 002851d2
wolf-120: – Logs begin at Fri 2021-06-18 21:07:00 UTC, end at Mon 2021-06-21 14:12:30 UTC. –
wolf-120: Jun 19 00:12:33 wolf-120. srv[14654]: ERROR: daos_engine:1 mlx5: wolf-120.: got completion with error:
wolf-120: Jun 19 00:12:33 wolf-120. srv[14654]: 00000000 00000000 00000000 00000000
wolf-120: Jun 19 00:12:33 wolf-120. srv[14654]: 00000000 00000000 00000000 00000000
wolf-120: Jun 19 00:12:33 wolf-120. srv[14654]: 00000289 00000000 00000000 00000000
wolf-120: Jun 19 00:12:33 wolf-120. srv[14654]: 00000000 00008813 10002a9b 00589bd2
wolf-120: Jun 19 00:12:33 wolf-120. srv[14654]: mlx5: wolf-120.: got completion with error:
wolf-120: Jun 19 00:12:33 wolf-120. srv[14654]: 00000000 00000000 00000000 00000000
wolf-120: Jun 19 00:12:33 wolf-120. srv[14654]: 00000000 00000000 00000000 00000000
wolf-120: Jun 19 00:12:33 wolf-120. srv[14654]: 00000299 00000000 00000000 00000000
wolf-120: Jun 19 00:12:33 wolf-120. srv[14654]: 00000000 00008813 10002aab 005da1d2
wolf-120: Jun 19 00:12:33 wolf-120. srv[14654]: mlx5: wolf-120.: got completion with error:
wolf-120: Jun 19 00:12:33 wolf-120. srv[14654]: 00000000 00000000 00000000 00000000
wolf-120: Jun 19 00:12:33 wolf-120. srv[14654]: 00000000 00000000 00000000 00000000
wolf-120: Jun 19 00:12:33 wolf-120. srv[14654]: 000002a4 00000000 00000000 00000000
wolf-120: Jun 19 00:12:33 wolf-120. srv[14654]: 00000000 00008813 10002ab6 004994d2
wolf-121: – Logs begin at Fri 2021-06-18 21:07:29 UTC, end at Mon 2021-06-21 14:12:31 UTC. –
wolf-121: Jun 19 00:12:33 wolf-121. srv[14622]: ERROR: daos_engine:0 mlx5: wolf-121.: got completion with error:
wolf-121: Jun 19 00:12:33 wolf-121. srv[14622]: 00000000 00000000 00000000 00000000
wolf-121: Jun 19 00:12:33 wolf-121. srv[14622]: 00000000 00000000 00000000 00000000
wolf-121: Jun 19 00:12:33 wolf-121. srv[14622]: 00000162 00000000 00000000 00000000
wolf-121: Jun 19 00:12:33 wolf-121. srv[14622]: 00000000 00008813 10002b7f 003520d2